linux/drivers/net/ethernet/intel/ice/ice_txrx_lib.c

562 lines
14 KiB
C
Raw Normal View History

// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2019, Intel Corporation. */
#include <linux/filter.h>
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
#include <linux/net/intel/libie/rx.h>
#include "ice_txrx_lib.h"
#include "ice_eswitch.h"
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
#include "ice_lib.h"
/**
* ice_release_rx_desc - Store the new tail and head values
* @rx_ring: ring to bump
* @val: new head index
*/
void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val)
{
u16 prev_ntu = rx_ring->next_to_use & ~0x7;
rx_ring->next_to_use = val;
/* update next to alloc since we have filled the ring */
rx_ring->next_to_alloc = val;
/* QRX_TAIL will be updated with any tail value, but hardware ignores
* the lower 3 bits. This makes it so we only bump tail on meaningful
* boundaries. Also, this allows us to bump tail on intervals of 8 up to
* the budget depending on the current traffic load.
*/
val &= ~0x7;
if (prev_ntu != val) {
/* Force memory writes to complete before letting h/w
* know there are new descriptors to fetch. (Only
* applicable for weak-ordered memory model archs,
* such as IA-64).
*/
wmb();
writel(val, rx_ring->tail);
}
}
/**
* ice_get_rx_hash - get RX hash value from descriptor
* @rx_desc: specific descriptor
*
* Returns hash, if present, 0 otherwise.
*/
static u32 ice_get_rx_hash(const union ice_32b_rx_flex_desc *rx_desc)
{
const struct ice_32b_rx_flex_desc_nic *nic_mdid;
if (unlikely(rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC))
return 0;
nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
return le32_to_cpu(nic_mdid->rss_hash);
}
/**
* ice_rx_hash_to_skb - set the hash value in the skb
* @rx_ring: descriptor ring
* @rx_desc: specific descriptor
* @skb: pointer to current skb
* @rx_ptype: the ptype value from the descriptor
*/
static void
ice_rx_hash_to_skb(const struct ice_rx_ring *rx_ring,
const union ice_32b_rx_flex_desc *rx_desc,
struct sk_buff *skb, u16 rx_ptype)
{
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
struct libeth_rx_pt decoded;
u32 hash;
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
decoded = libie_rx_pt_parse(rx_ptype);
if (!libeth_rx_pt_has_hash(rx_ring->netdev, decoded))
return;
hash = ice_get_rx_hash(rx_desc);
if (likely(hash))
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
libeth_rx_pt_set_hash(skb, hash, decoded);
}
/**
* ice_rx_csum - Indicate in skb if checksum is good
* @ring: the ring we care about
* @skb: skb currently being received and modified
* @rx_desc: the receive descriptor
* @ptype: the packet type decoded by hardware
*
* skb->protocol must be set before this function is called
*/
static void
ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
{
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
struct libeth_rx_pt decoded;
u16 rx_status0, rx_status1;
bool ipv4, ipv6;
/* Start with CHECKSUM_NONE and by default csum_level = 0 */
skb->ip_summed = CHECKSUM_NONE;
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
decoded = libie_rx_pt_parse(ptype);
if (!libeth_rx_pt_has_checksum(ring->netdev, decoded))
return;
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
rx_status0 = le16_to_cpu(rx_desc->wb.status_error0);
rx_status1 = le16_to_cpu(rx_desc->wb.status_error1);
/* check if HW has decoded the packet and checksum */
if (!(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_L3L4P_S)))
return;
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
ipv4 = libeth_rx_pt_get_ip_ver(decoded) == LIBETH_RX_PT_OUTER_IPV4;
ipv6 = libeth_rx_pt_get_ip_ver(decoded) == LIBETH_RX_PT_OUTER_IPV6;
if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EIPE_S)))) {
ring->vsi->back->hw_rx_eipe_error++;
return;
}
if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_IPE_S))))
goto checksum_fail;
if (ipv6 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_IPV6EXADD_S))))
goto checksum_fail;
/* check for L4 errors and handle packets that were not able to be
* checksummed due to arrival speed
*/
if (rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_L4E_S))
goto checksum_fail;
/* check for outer UDP checksum error in tunneled packets */
if ((rx_status1 & BIT(ICE_RX_FLEX_DESC_STATUS1_NAT_S)) &&
(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EUDPE_S)))
goto checksum_fail;
/* If there is an outer header present that might contain a checksum
* we need to bump the checksum level by 1 to reflect the fact that
* we are indicating we validated the inner checksum.
*/
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
if (decoded.tunnel_type >= LIBETH_RX_PT_TUNNEL_IP_GRENAT)
skb->csum_level = 1;
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
skb->ip_summed = CHECKSUM_UNNECESSARY;
return;
checksum_fail:
ring->vsi->back->hw_csum_rx_error++;
}
/**
* ice_ptp_rx_hwts_to_skb - Put RX timestamp into skb
* @rx_ring: Ring to get the VSI info
* @rx_desc: Receive descriptor
* @skb: Particular skb to send timestamp with
*
* The timestamp is in ns, so we must convert the result first.
*/
static void
ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
const union ice_32b_rx_flex_desc *rx_desc,
struct sk_buff *skb)
{
u64 ts_ns = ice_ptp_get_rx_hwts(rx_desc, &rx_ring->pkt_ctx);
skb_hwtstamps(skb)->hwtstamp = ns_to_ktime(ts_ns);
}
/**
* ice_get_ptype - Read HW packet type from the descriptor
* @rx_desc: RX descriptor
*/
static u16 ice_get_ptype(const union ice_32b_rx_flex_desc *rx_desc)
{
return le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
ICE_RX_FLEX_DESC_PTYPE_M;
}
/**
* ice_process_skb_fields - Populate skb header fields from Rx descriptor
* @rx_ring: Rx descriptor ring packet is being transacted on
* @rx_desc: pointer to the EOP Rx descriptor
* @skb: pointer to current skb being populated
*
* This function checks the ring, descriptor, and packet information in
* order to populate the hash, checksum, VLAN, protocol, and
* other fields within the skb.
*/
void
ice_process_skb_fields(struct ice_rx_ring *rx_ring,
union ice_32b_rx_flex_desc *rx_desc,
struct sk_buff *skb)
{
u16 ptype = ice_get_ptype(rx_desc);
ice_rx_hash_to_skb(rx_ring, rx_desc, skb, ptype);
/* modifies the skb - consumes the enet header */
if (unlikely(rx_ring->flags & ICE_RX_FLAGS_MULTIDEV)) {
struct net_device *netdev = ice_eswitch_get_target(rx_ring,
rx_desc);
if (ice_is_port_repr_netdev(netdev))
ice_repr_inc_rx_stats(netdev, skb->len);
skb->protocol = eth_type_trans(skb, netdev);
} else {
skb->protocol = eth_type_trans(skb, rx_ring->netdev);
}
ice_rx_csum(rx_ring, skb, rx_desc, ptype);
ice: enable receive hardware timestamping Add SIOCGHWTSTAMP and SIOCSHWTSTAMP ioctl handlers to respond to requests to enable timestamping support. If the request is for enabling Rx timestamps, set a bit in the Rx descriptors to indicate that receive timestamps should be reported. Hardware captures receive timestamps in the PHY which only captures part of the timer, and reports only 40 bits into the Rx descriptor. The upper 32 bits represent the contents of GLTSYN_TIME_L at the point of packet reception, while the lower 8 bits represent the upper 8 bits of GLTSYN_TIME_0. The networking and PTP stack expect 64 bit timestamps in nanoseconds. To support this, implement some logic to extend the timestamps by using the full PHC time. If the Rx timestamp was captured prior to the PHC time, then the real timestamp is PHC - (lower_32_bits(PHC) - timestamp) If the Rx timestamp was captured after the PHC time, then the real timestamp is PHC + (timestamp - lower_32_bits(PHC)) These calculations are correct as long as neither the PHC timestamp nor the Rx timestamps are more than 2^32-1 nanseconds old. Further, we can detect when the Rx timestamp is before or after the PHC as long as the PHC timestamp is no more than 2^31-1 nanoseconds old. In that case, we calculate the delta between the lower 32 bits of the PHC and the Rx timestamp. If it's larger than 2^31-1 then the Rx timestamp must have been captured in the past. If it's smaller, then the Rx timestamp must have been captured after PHC time. Add an ice_ptp_extend_32b_ts function that relies on a cached copy of the PHC time and implements this algorithm to calculate the proper upper 32bits of the Rx timestamps. Cache the PHC time periodically in all of the Rx rings. This enables each Rx ring to simply call the extension function with a recent copy of the PHC time. By ensuring that the PHC time is kept up to date periodically, we ensure this algorithm doesn't use stale data and produce incorrect results. To cache the time, introduce a kworker and a kwork item to periodically store the Rx time. It might seem like we should use the .do_aux_work interface of the PTP clock. This doesn't work because all PFs must cache this time, but only one PF owns the PTP clock device. Thus, the ice driver will manage its own kthread instead of relying on the PTP do_aux_work handler. With this change, the driver can now report Rx timestamps on all incoming packets. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-09 09:39:52 -07:00
if (rx_ring->ptp_rx)
ice_ptp_rx_hwts_to_skb(rx_ring, rx_desc, skb);
}
/**
* ice_receive_skb - Send a completed packet up the stack
* @rx_ring: Rx ring in play
* @skb: packet to send up
* @vlan_tci: VLAN TCI for packet
*
* This function sends the completed packet (via. skb) up the stack using
* gro receive functions (with/without VLAN tag)
*/
void
ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tci)
{
if ((vlan_tci & VLAN_VID_MASK) && rx_ring->vlan_proto)
__vlan_hwaccel_put_tag(skb, rx_ring->vlan_proto,
vlan_tci);
ice: Add hot path support for 802.1Q and 802.1ad VLAN offloads Currently the driver only supports 802.1Q VLAN insertion and stripping. However, once Double VLAN Mode (DVM) is fully supported, then both 802.1Q and 802.1ad VLAN insertion and stripping will be supported. Unfortunately the VSI context parameters only allow for one VLAN ethertype at a time for VLAN offloads so only one or the other VLAN ethertype offload can be supported at once. To support this, multiple changes are needed. Rx path changes: [1] In DVM, the Rx queue context l2tagsel field needs to be cleared so the outermost tag shows up in the l2tag2_2nd field of the Rx flex descriptor. In Single VLAN Mode (SVM), the l2tagsel field should remain 1 to support SVM configurations. [2] Modify the ice_test_staterr() function to take a __le16 instead of the ice_32b_rx_flex_desc union pointer so this function can be used for both rx_desc->wb.status_error0 and rx_desc->wb.status_error1. [3] Add the new inline function ice_get_vlan_tag_from_rx_desc() that checks if there is a VLAN tag in l2tag1 or l2tag2_2nd. [4] In ice_receive_skb(), add a check to see if NETIF_F_HW_VLAN_STAG_RX is enabled in netdev->features. If it is, then this is the VLAN ethertype that needs to be added to the stripping VLAN tag. Since ice_fix_features() prevents CTAG_RX and STAG_RX from being enabled simultaneously, the VLAN ethertype will only ever be 802.1Q or 802.1ad. Tx path changes: [1] In DVM, the VLAN tag needs to be placed in the l2tag2 field of the Tx context descriptor. The new define ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN was added to the list of tx_flags to handle this case. [2] When the stack requests the VLAN tag to be offloaded on Tx, the driver needs to set either ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN or ICE_TX_FLAGS_HW_VLAN, so the tag is inserted in l2tag2 or l2tag1 respectively. To determine which location to use, set a bit in the Tx ring flags field during ring allocation that can be used to determine which field to use in the Tx descriptor. In DVM, always use l2tag2, and in SVM, always use l2tag1. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-02 08:38:47 -08:00
napi_gro_receive(&rx_ring->q_vector->napi, skb);
}
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/**
* ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
* @dev: device for DMA mapping
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
* @tx_buf: Tx buffer to clean
* @bq: XDP bulk flush struct
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
*/
static void
ice_clean_xdp_tx_buf(struct device *dev, struct ice_tx_buf *tx_buf,
struct xdp_frame_bulk *bq)
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
{
dma_unmap_single(dev, dma_unmap_addr(tx_buf, dma),
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
dma_unmap_len_set(tx_buf, len, 0);
ice: Robustify cleaning/completing XDP Tx buffers When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
switch (tx_buf->type) {
case ICE_TX_BUF_XDP_TX:
page_frag_free(tx_buf->raw_buf);
break;
case ICE_TX_BUF_XDP_XMIT:
xdp_return_frame_bulk(tx_buf->xdpf, bq);
break;
ice: Robustify cleaning/completing XDP Tx buffers When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
}
tx_buf->type = ICE_TX_BUF_EMPTY;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
}
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
/**
* ice_clean_xdp_irq - Reclaim resources after transmit completes on XDP ring
* @xdp_ring: XDP ring to clean
*/
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
static u32 ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
{
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
int total_bytes = 0, total_pkts = 0;
struct device *dev = xdp_ring->dev;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
u32 ntc = xdp_ring->next_to_clean;
struct ice_tx_desc *tx_desc;
u32 cnt = xdp_ring->count;
struct xdp_frame_bulk bq;
u32 frags, xdp_tx = 0;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
u32 ready_frames = 0;
u32 idx;
u32 ret;
idx = xdp_ring->tx_buf[ntc].rs_idx;
tx_desc = ICE_TX_DESC(xdp_ring, idx);
if (tx_desc->cmd_type_offset_bsz &
cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)) {
if (idx >= ntc)
ready_frames = idx - ntc + 1;
else
ready_frames = idx + cnt - ntc + 1;
}
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
ice: Fix XDP Tx ring overrun Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
if (unlikely(!ready_frames))
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
return 0;
ret = ready_frames;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
xdp_frame_bulk_init(&bq);
rcu_read_lock(); /* xdp_return_frame_bulk() */
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
while (ready_frames) {
struct ice_tx_buf *tx_buf = &xdp_ring->tx_buf[ntc];
ice: Robustify cleaning/completing XDP Tx buffers When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
struct ice_tx_buf *head = tx_buf;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/* bytecount holds size of head + frags */
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
total_bytes += tx_buf->bytecount;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
frags = tx_buf->nr_frags;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
total_pkts++;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/* count head + frags */
ready_frames -= frags + 1;
xdp_tx++;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
ntc++;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
if (ntc == cnt)
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
ntc = 0;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
for (int i = 0; i < frags; i++) {
tx_buf = &xdp_ring->tx_buf[ntc];
ice_clean_xdp_tx_buf(dev, tx_buf, &bq);
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
ntc++;
if (ntc == cnt)
ntc = 0;
}
ice: Robustify cleaning/completing XDP Tx buffers When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
ice_clean_xdp_tx_buf(dev, head, &bq);
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
}
xdp_flush_frame_bulk(&bq);
rcu_read_unlock();
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
tx_desc->cmd_type_offset_bsz = 0;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
xdp_ring->next_to_clean = ntc;
xdp_ring->xdp_tx_active -= xdp_tx;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
return ret;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
}
/**
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
* __ice_xmit_xdp_ring - submit frame to XDP ring for transmission
* @xdp: XDP buffer to be placed onto Tx descriptors
* @xdp_ring: XDP ring for transmission
* @frame: whether this comes from .ndo_xdp_xmit()
*/
int __ice_xmit_xdp_ring(struct xdp_buff *xdp, struct ice_tx_ring *xdp_ring,
bool frame)
{
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
struct skb_shared_info *sinfo = NULL;
u32 size = xdp->data_end - xdp->data;
struct device *dev = xdp_ring->dev;
u32 ntu = xdp_ring->next_to_use;
struct ice_tx_desc *tx_desc;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
struct ice_tx_buf *tx_head;
struct ice_tx_buf *tx_buf;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
u32 cnt = xdp_ring->count;
void *data = xdp->data;
u32 nr_frags = 0;
u32 free_space;
u32 frag = 0;
free_space = ICE_DESC_UNUSED(xdp_ring);
ice: Fix XDP Tx ring overrun Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
if (free_space < ICE_RING_QUARTER(xdp_ring))
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
free_space += ice_clean_xdp_irq(xdp_ring);
ice: Fix XDP Tx ring overrun Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
if (unlikely(!free_space))
goto busy;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
if (unlikely(xdp_buff_has_frags(xdp))) {
sinfo = xdp_get_shared_info_from_buff(xdp);
nr_frags = sinfo->nr_frags;
ice: Fix XDP Tx ring overrun Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
if (free_space < nr_frags + 1)
goto busy;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
}
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
tx_desc = ICE_TX_DESC(xdp_ring, ntu);
tx_head = &xdp_ring->tx_buf[ntu];
tx_buf = tx_head;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
for (;;) {
dma_addr_t dma;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
dma = dma_map_single(dev, data, size, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma))
goto dma_unmap;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/* record length, and DMA address */
dma_unmap_len_set(tx_buf, len, size);
dma_unmap_addr_set(tx_buf, dma, dma);
if (frame) {
tx_buf->type = ICE_TX_BUF_FRAG;
} else {
tx_buf->type = ICE_TX_BUF_XDP_TX;
tx_buf->raw_buf = data;
}
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
tx_desc->buf_addr = cpu_to_le64(dma);
tx_desc->cmd_type_offset_bsz = ice_build_ctob(0, 0, size, 0);
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
ntu++;
if (ntu == cnt)
ntu = 0;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
if (frag == nr_frags)
break;
tx_desc = ICE_TX_DESC(xdp_ring, ntu);
tx_buf = &xdp_ring->tx_buf[ntu];
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
data = skb_frag_address(&sinfo->frags[frag]);
size = skb_frag_size(&sinfo->frags[frag]);
frag++;
ice: optimize XDP_TX workloads Optimize Tx descriptor cleaning for XDP. Current approach doesn't really scale and chokes when multiple flows are handled. Introduce two ring fields, @next_dd and @next_rs that will keep track of descriptor that should be looked at when the need for cleaning arise and the descriptor that should have the RS bit set, respectively. Note that at this point the threshold is a constant (32), but it is something that we could make configurable. First thing is to get away from setting RS bit on each descriptor. Let's do this only once NTU is higher than the currently @next_rs value. In such case, grab the tx_desc[next_rs], set the RS bit in descriptor and advance the @next_rs by a 32. Second thing is to clean the Tx ring only when there are less than 32 free entries. For that case, look up the tx_desc[next_dd] for a DD bit. This bit is written back by HW to let the driver know that xmit was successful. It will happen only for those descriptors that had RS bit set. Clean only 32 descriptors and advance the DD bit. Actual cleaning routine is moved from ice_napi_poll() down to the ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any SKBs in there that would rely on interrupts for the cleaning. Nice side effect is that for rare case of Tx fallback path (that next patch is going to introduce) we don't have to trigger the SW irq to clean the ring. With those two concepts, ring is kept at being almost full, but it is guaranteed that driver will be able to produce Tx descriptors. This approach seems to work out well even though the Tx descriptors are produced in one-by-one manner. Test was conducted with the ice HW bombarded with packets from HW generator, configured to generate 30 flows. Xdp2 sample yields the following results: <snip> proto 17: 79973066 pkt/s proto 17: 80018911 pkt/s proto 17: 80004654 pkt/s proto 17: 79992395 pkt/s proto 17: 79975162 pkt/s proto 17: 79955054 pkt/s proto 17: 79869168 pkt/s proto 17: 79823947 pkt/s proto 17: 79636971 pkt/s </snip> As that sample reports the Rx'ed frames, let's look at sar output. It says that what we Rx'ed we do actually Tx, no noticeable drops. Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32 with tx_busy staying calm. When compared to a state before: Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64 it can be observed that the amount of txpck/s is almost doubled, meaning that the performance is improved by around 90%. All of this due to the drops in the driver, previously the tx_busy stat was bumped at a 7mpps rate. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
}
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/* store info about bytecount and frag count in first desc */
tx_head->bytecount = xdp_get_buff_len(xdp);
tx_head->nr_frags = nr_frags;
if (frame) {
tx_head->type = ICE_TX_BUF_XDP_XMIT;
tx_head->xdpf = xdp->data_hard_start;
}
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/* update last descriptor from a frame with EOP */
tx_desc->cmd_type_offset_bsz |=
cpu_to_le64(ICE_TX_DESC_CMD_EOP << ICE_TXD_QW1_CMD_S);
xdp_ring->xdp_tx_active++;
xdp_ring->next_to_use = ntu;
return ICE_XDP_TX;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
dma_unmap:
for (;;) {
tx_buf = &xdp_ring->tx_buf[ntu];
dma_unmap_page(dev, dma_unmap_addr(tx_buf, dma),
dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
dma_unmap_len_set(tx_buf, len, 0);
if (tx_buf == tx_head)
break;
if (!ntu)
ntu += cnt;
ntu--;
}
return ICE_XDP_CONSUMED;
ice: Fix XDP Tx ring overrun Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
busy:
xdp_ring->ring_stats->tx_stats.tx_busy++;
return ICE_XDP_CONSUMED;
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
}
/**
* ice_finalize_xdp_rx - Bump XDP Tx tail and/or flush redirect map
* @xdp_ring: XDP ring
* @xdp_res: Result of the receive batch
* @first_idx: index to write from caller
*
* This function bumps XDP Tx tail and/or flush redirect map, and
* should be called when a batch of packets has been processed in the
* napi loop.
*/
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
u32 first_idx)
{
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
struct ice_tx_buf *tx_buf = &xdp_ring->tx_buf[first_idx];
if (xdp_res & ICE_XDP_REDIR)
net: Tree wide: Replace xdp_do_flush_map() with xdp_do_flush(). xdp_do_flush_map() is deprecated and new code should use xdp_do_flush() instead. Replace xdp_do_flush_map() with xdp_do_flush(). Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: Clark Wang <xiaoning.wang@nxp.com> Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: David Arinzon <darinzon@amazon.com> Cc: Edward Cree <ecree.xilinx@gmail.com> Cc: Felix Fietkau <nbd@nbd.name> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Jassi Brar <jaswinder.singh@linaro.org> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: John Crispin <john@phrozen.org> Cc: Leon Romanovsky <leon@kernel.org> Cc: Lorenzo Bianconi <lorenzo@kernel.org> Cc: Louis Peens <louis.peens@corigine.com> Cc: Marcin Wojtas <mw@semihalf.com> Cc: Mark Lee <Mark-MC.Lee@mediatek.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: NXP Linux Team <linux-imx@nxp.com> Cc: Noam Dagan <ndagan@amazon.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Saeed Bishara <saeedb@amazon.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Sean Wang <sean.wang@mediatek.com> Cc: Shay Agroskin <shayagr@amazon.com> Cc: Shenwei Wang <shenwei.wang@nxp.com> Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com> Cc: Tony Nguyen <anthony.l.nguyen@intel.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Cc: Wei Fang <wei.fang@nxp.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Arthur Kiyanovski <akiyano@amazon.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/20230908143215.869913-2-bigeasy@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-09-08 16:32:14 +02:00
xdp_do_flush();
ice: introduce XDP_TX fallback path Under rare circumstances there might be a situation where a requirement of having XDP Tx queue per CPU could not be fulfilled and some of the Tx resources have to be shared between CPUs. This yields a need for placing accesses to xdp_ring inside a critical section protected by spinlock. These accesses happen to be in the hot path, so let's introduce the static branch that will be triggered from the control plane when driver could not provide Tx queue dedicated for XDP on each CPU. Currently, the design that has been picked is to allow any number of XDP Tx queues that is at least half of a count of CPUs that platform has. For lower number driver will bail out with a response to user that there were not enough Tx resources that would allow configuring XDP. The sharing of rings is signalled via static branch enablement which in turn indicates that lock for xdp_ring accesses needs to be taken in hot path. Approach based on static branch has no impact on performance of a non-fallback path. One thing that is needed to be mentioned is a fact that the static branch will act as a global driver switch, meaning that if one PF got out of Tx resources, then other PFs that ice driver is servicing will suffer. However, given the fact that HW that ice driver is handling has 1024 Tx queues per each PF, this is currently an unlikely scenario. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:03 +02:00
if (xdp_res & ICE_XDP_TX) {
if (static_branch_unlikely(&ice_xdp_locking_key))
spin_lock(&xdp_ring->tx_lock);
ice: Add support for XDP multi-buffer on Tx side Similarly as for Rx side in previous patch, logic on XDP Tx in ice driver needs to be adjusted for multi-buffer support. Specifically, the way how HW Tx descriptors are produced and cleaned. Currently, XDP_TX works on strict ring boundaries, meaning it sets RS bit (on producer side) / looks up DD bit (on consumer/cleaning side) every quarter of the ring. It means that if for example multi buffer frame would span across the ring quarter boundary (say that frame consists of 4 frames and we start from 62 descriptor where ring is sized to 256 entries), RS bit would be produced in the middle of multi buffer frame, which would be a broken behavior as it needs to be set on the last descriptor of the frame. To make it work, set RS bit at the last descriptor from the batch of frames that XDP_TX action was used on and make the first entry remember the index of last descriptor with RS bit set. This way, cleaning side can take the index of descriptor with RS bit, look up DD bit's presence and clean from first entry to last. In order to clean up the code base introduce the common ice_set_rs_bit() which will return index of descriptor that got RS bit produced on so that standard driver can store this within proper ice_tx_buf and ZC driver can simply ignore return value. Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Link: https://lore.kernel.org/bpf/20230131204506.219292-12-maciej.fijalkowski@intel.com
2023-01-31 21:45:04 +01:00
/* store index of descriptor with RS bit set in the first
* ice_tx_buf of given NAPI batch
*/
tx_buf->rs_idx = ice_set_rs_bit(xdp_ring);
ice_xdp_ring_update_tail(xdp_ring);
ice: introduce XDP_TX fallback path Under rare circumstances there might be a situation where a requirement of having XDP Tx queue per CPU could not be fulfilled and some of the Tx resources have to be shared between CPUs. This yields a need for placing accesses to xdp_ring inside a critical section protected by spinlock. These accesses happen to be in the hot path, so let's introduce the static branch that will be triggered from the control plane when driver could not provide Tx queue dedicated for XDP on each CPU. Currently, the design that has been picked is to allow any number of XDP Tx queues that is at least half of a count of CPUs that platform has. For lower number driver will bail out with a response to user that there were not enough Tx resources that would allow configuring XDP. The sharing of rings is signalled via static branch enablement which in turn indicates that lock for xdp_ring accesses needs to be taken in hot path. Approach based on static branch has no impact on performance of a non-fallback path. One thing that is needed to be mentioned is a fact that the static branch will act as a global driver switch, meaning that if one PF got out of Tx resources, then other PFs that ice driver is servicing will suffer. However, given the fact that HW that ice driver is handling has 1024 Tx queues per each PF, this is currently an unlikely scenario. Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:03 +02:00
if (static_branch_unlikely(&ice_xdp_locking_key))
spin_unlock(&xdp_ring->tx_lock);
}
}
/**
* ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
* @ctx: XDP buff pointer
* @ts_ns: destination address
*
* Copy HW timestamp (if available) to the destination address.
*/
static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
{
const struct ice_xdp_buff *xdp_ext = (void *)ctx;
*ts_ns = ice_ptp_get_rx_hwts(xdp_ext->eop_desc,
xdp_ext->pkt_ctx);
if (!*ts_ns)
return -ENODATA;
return 0;
}
/**
* ice_xdp_rx_hash_type - Get XDP-specific hash type from the RX descriptor
* @eop_desc: End of Packet descriptor
*/
static enum xdp_rss_hash_type
ice_xdp_rx_hash_type(const union ice_32b_rx_flex_desc *eop_desc)
{
net: intel: introduce {, Intel} Ethernet common library Not a secret there's a ton of code duplication between two and more Intel ethernet modules. Before introducing new changes, which would need to be copied over again, start decoupling the already existing duplicate functionality into a new module, which will be shared between several Intel Ethernet drivers. Add the lookup table which converts 8/10-bit hardware packet type into a parsed bitfield structure for easy checking packet format parameters, such as payload level, IP version, etc. This is currently used by i40e, ice and iavf and it's all the same in all three drivers. The only difference introduced in this implementation is that instead of defining a 256 (or 1024 in case of ice) element array, add unlikely() condition to limit the input to 154 (current maximum non-reserved packet type). There's no reason to waste 600 (or even 3600) bytes only to not hurt very unlikely exception packets. The hash computation function now takes payload level directly as a pkt_hash_type. There's a couple cases when non-IP ptypes are marked as L3 payload and in the previous versions their hash level would be 2, not 3. But skb_set_hash() only sees difference between L4 and non-L4, thus this won't change anything at all. The module is behind the hidden Kconfig symbol, which the drivers will select when needed. The exports are behind 'LIBIE' namespace to limit the scope of the functions. Not that non-HW-specific symbols will live in yet another module, libeth. This is done to easily distinguish pretty generic code ready for reusing by any other vendor and/or for moving the layer up from the code useful in Intel's 1-100G drivers only. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
return libie_rx_pt_parse(ice_get_ptype(eop_desc)).hash_type;
}
/**
* ice_xdp_rx_hash - RX hash XDP hint handler
* @ctx: XDP buff pointer
* @hash: hash destination address
* @rss_type: XDP hash type destination address
*
* Copy RX hash (if available) and its type to the destination address.
*/
static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
enum xdp_rss_hash_type *rss_type)
{
const struct ice_xdp_buff *xdp_ext = (void *)ctx;
*hash = ice_get_rx_hash(xdp_ext->eop_desc);
*rss_type = ice_xdp_rx_hash_type(xdp_ext->eop_desc);
if (!likely(*hash))
return -ENODATA;
return 0;
}
/**
* ice_xdp_rx_vlan_tag - VLAN tag XDP hint handler
* @ctx: XDP buff pointer
* @vlan_proto: destination address for VLAN protocol
* @vlan_tci: destination address for VLAN TCI
*
* Copy VLAN tag (if was stripped) and corresponding protocol
* to the destination address.
*/
static int ice_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
u16 *vlan_tci)
{
const struct ice_xdp_buff *xdp_ext = (void *)ctx;
*vlan_proto = xdp_ext->pkt_ctx->vlan_proto;
if (!*vlan_proto)
return -ENODATA;
*vlan_tci = ice_get_vlan_tci(xdp_ext->eop_desc);
if (!*vlan_tci)
return -ENODATA;
return 0;
}
const struct xdp_metadata_ops ice_xdp_md_ops = {
.xmo_rx_timestamp = ice_xdp_rx_hw_ts,
.xmo_rx_hash = ice_xdp_rx_hash,
.xmo_rx_vlan_tag = ice_xdp_rx_vlan_tag,
};