2019-11-04 09:38:56 -08:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
/* Copyright (c) 2019, Intel Corporation. */
|
|
|
|
|
2021-12-28 16:49:13 -08:00
|
|
|
#include <linux/filter.h>
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
#include <linux/net/intel/libie/rx.h>
|
2021-12-28 16:49:13 -08:00
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
#include "ice_txrx_lib.h"
|
2021-08-19 17:08:58 -07:00
|
|
|
#include "ice_eswitch.h"
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
#include "ice_lib.h"
|
2019-11-04 09:38:56 -08:00
|
|
|
|
|
|
|
/**
|
|
|
|
* ice_release_rx_desc - Store the new tail and head values
|
|
|
|
* @rx_ring: ring to bump
|
|
|
|
* @val: new head index
|
|
|
|
*/
|
2021-08-19 13:59:58 +02:00
|
|
|
void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
2020-02-06 01:20:02 -08:00
|
|
|
u16 prev_ntu = rx_ring->next_to_use & ~0x7;
|
2019-11-04 09:38:56 -08:00
|
|
|
|
|
|
|
rx_ring->next_to_use = val;
|
|
|
|
|
|
|
|
/* update next to alloc since we have filled the ring */
|
|
|
|
rx_ring->next_to_alloc = val;
|
|
|
|
|
|
|
|
/* QRX_TAIL will be updated with any tail value, but hardware ignores
|
|
|
|
* the lower 3 bits. This makes it so we only bump tail on meaningful
|
|
|
|
* boundaries. Also, this allows us to bump tail on intervals of 8 up to
|
|
|
|
* the budget depending on the current traffic load.
|
|
|
|
*/
|
|
|
|
val &= ~0x7;
|
|
|
|
if (prev_ntu != val) {
|
|
|
|
/* Force memory writes to complete before letting h/w
|
|
|
|
* know there are new descriptors to fetch. (Only
|
|
|
|
* applicable for weak-ordered memory model archs,
|
|
|
|
* such as IA-64).
|
|
|
|
*/
|
|
|
|
wmb();
|
|
|
|
writel(val, rx_ring->tail);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2023-12-05 22:08:30 +01:00
|
|
|
* ice_get_rx_hash - get RX hash value from descriptor
|
|
|
|
* @rx_desc: specific descriptor
|
|
|
|
*
|
|
|
|
* Returns hash, if present, 0 otherwise.
|
|
|
|
*/
|
|
|
|
static u32 ice_get_rx_hash(const union ice_32b_rx_flex_desc *rx_desc)
|
|
|
|
{
|
|
|
|
const struct ice_32b_rx_flex_desc_nic *nic_mdid;
|
|
|
|
|
|
|
|
if (unlikely(rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
|
|
|
|
return le32_to_cpu(nic_mdid->rss_hash);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* ice_rx_hash_to_skb - set the hash value in the skb
|
2019-11-04 09:38:56 -08:00
|
|
|
* @rx_ring: descriptor ring
|
|
|
|
* @rx_desc: specific descriptor
|
|
|
|
* @skb: pointer to current skb
|
|
|
|
* @rx_ptype: the ptype value from the descriptor
|
|
|
|
*/
|
|
|
|
static void
|
2023-12-05 22:08:30 +01:00
|
|
|
ice_rx_hash_to_skb(const struct ice_rx_ring *rx_ring,
|
|
|
|
const union ice_32b_rx_flex_desc *rx_desc,
|
|
|
|
struct sk_buff *skb, u16 rx_ptype)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
struct libeth_rx_pt decoded;
|
2019-11-04 09:38:56 -08:00
|
|
|
u32 hash;
|
|
|
|
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
decoded = libie_rx_pt_parse(rx_ptype);
|
|
|
|
if (!libeth_rx_pt_has_hash(rx_ring->netdev, decoded))
|
2019-11-04 09:38:56 -08:00
|
|
|
return;
|
|
|
|
|
2023-12-05 22:08:30 +01:00
|
|
|
hash = ice_get_rx_hash(rx_desc);
|
|
|
|
if (likely(hash))
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
libeth_rx_pt_set_hash(skb, hash, decoded);
|
2019-11-04 09:38:56 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* ice_rx_csum - Indicate in skb if checksum is good
|
|
|
|
* @ring: the ring we care about
|
|
|
|
* @skb: skb currently being received and modified
|
|
|
|
* @rx_desc: the receive descriptor
|
|
|
|
* @ptype: the packet type decoded by hardware
|
|
|
|
*
|
|
|
|
* skb->protocol must be set before this function is called
|
|
|
|
*/
|
|
|
|
static void
|
2021-08-19 13:59:58 +02:00
|
|
|
ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
|
2021-02-23 15:47:05 -08:00
|
|
|
union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
struct libeth_rx_pt decoded;
|
2020-05-15 17:42:19 -07:00
|
|
|
u16 rx_status0, rx_status1;
|
2019-11-04 09:38:56 -08:00
|
|
|
bool ipv4, ipv6;
|
|
|
|
|
|
|
|
/* Start with CHECKSUM_NONE and by default csum_level = 0 */
|
|
|
|
skb->ip_summed = CHECKSUM_NONE;
|
|
|
|
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
decoded = libie_rx_pt_parse(ptype);
|
|
|
|
if (!libeth_rx_pt_has_checksum(ring->netdev, decoded))
|
2019-11-04 09:38:56 -08:00
|
|
|
return;
|
|
|
|
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
rx_status0 = le16_to_cpu(rx_desc->wb.status_error0);
|
|
|
|
rx_status1 = le16_to_cpu(rx_desc->wb.status_error1);
|
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
/* check if HW has decoded the packet and checksum */
|
2020-05-15 17:42:19 -07:00
|
|
|
if (!(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_L3L4P_S)))
|
2019-11-04 09:38:56 -08:00
|
|
|
return;
|
|
|
|
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
ipv4 = libeth_rx_pt_get_ip_ver(decoded) == LIBETH_RX_PT_OUTER_IPV4;
|
|
|
|
ipv6 = libeth_rx_pt_get_ip_ver(decoded) == LIBETH_RX_PT_OUTER_IPV6;
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2024-01-03 15:11:15 +01:00
|
|
|
if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EIPE_S)))) {
|
|
|
|
ring->vsi->back->hw_rx_eipe_error++;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ipv4 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_IPE_S))))
|
2019-11-04 09:38:56 -08:00
|
|
|
goto checksum_fail;
|
2020-05-15 17:42:19 -07:00
|
|
|
|
|
|
|
if (ipv6 && (rx_status0 & (BIT(ICE_RX_FLEX_DESC_STATUS0_IPV6EXADD_S))))
|
2019-11-04 09:38:56 -08:00
|
|
|
goto checksum_fail;
|
|
|
|
|
|
|
|
/* check for L4 errors and handle packets that were not able to be
|
|
|
|
* checksummed due to arrival speed
|
|
|
|
*/
|
2020-05-15 17:42:19 -07:00
|
|
|
if (rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_L4E_S))
|
2019-11-04 09:38:56 -08:00
|
|
|
goto checksum_fail;
|
|
|
|
|
2020-05-06 09:32:30 -07:00
|
|
|
/* check for outer UDP checksum error in tunneled packets */
|
2020-05-15 17:42:19 -07:00
|
|
|
if ((rx_status1 & BIT(ICE_RX_FLEX_DESC_STATUS1_NAT_S)) &&
|
|
|
|
(rx_status0 & BIT(ICE_RX_FLEX_DESC_STATUS0_XSUM_EUDPE_S)))
|
2020-05-06 09:32:30 -07:00
|
|
|
goto checksum_fail;
|
|
|
|
|
|
|
|
/* If there is an outer header present that might contain a checksum
|
|
|
|
* we need to bump the checksum level by 1 to reflect the fact that
|
|
|
|
* we are indicating we validated the inner checksum.
|
|
|
|
*/
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
if (decoded.tunnel_type >= LIBETH_RX_PT_TUNNEL_IP_GRENAT)
|
2020-05-06 09:32:30 -07:00
|
|
|
skb->csum_level = 1;
|
|
|
|
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
skb->ip_summed = CHECKSUM_UNNECESSARY;
|
2019-11-04 09:38:56 -08:00
|
|
|
return;
|
|
|
|
|
|
|
|
checksum_fail:
|
|
|
|
ring->vsi->back->hw_csum_rx_error++;
|
|
|
|
}
|
|
|
|
|
2023-12-05 22:08:31 +01:00
|
|
|
/**
|
|
|
|
* ice_ptp_rx_hwts_to_skb - Put RX timestamp into skb
|
|
|
|
* @rx_ring: Ring to get the VSI info
|
|
|
|
* @rx_desc: Receive descriptor
|
|
|
|
* @skb: Particular skb to send timestamp with
|
|
|
|
*
|
|
|
|
* The timestamp is in ns, so we must convert the result first.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ice_ptp_rx_hwts_to_skb(struct ice_rx_ring *rx_ring,
|
|
|
|
const union ice_32b_rx_flex_desc *rx_desc,
|
|
|
|
struct sk_buff *skb)
|
|
|
|
{
|
2023-12-05 22:08:34 +01:00
|
|
|
u64 ts_ns = ice_ptp_get_rx_hwts(rx_desc, &rx_ring->pkt_ctx);
|
2023-12-05 22:08:31 +01:00
|
|
|
|
|
|
|
skb_hwtstamps(skb)->hwtstamp = ns_to_ktime(ts_ns);
|
|
|
|
}
|
|
|
|
|
2023-12-05 22:08:32 +01:00
|
|
|
/**
|
|
|
|
* ice_get_ptype - Read HW packet type from the descriptor
|
|
|
|
* @rx_desc: RX descriptor
|
|
|
|
*/
|
|
|
|
static u16 ice_get_ptype(const union ice_32b_rx_flex_desc *rx_desc)
|
|
|
|
{
|
|
|
|
return le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
|
|
|
|
ICE_RX_FLEX_DESC_PTYPE_M;
|
|
|
|
}
|
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
/**
|
|
|
|
* ice_process_skb_fields - Populate skb header fields from Rx descriptor
|
|
|
|
* @rx_ring: Rx descriptor ring packet is being transacted on
|
|
|
|
* @rx_desc: pointer to the EOP Rx descriptor
|
|
|
|
* @skb: pointer to current skb being populated
|
|
|
|
*
|
|
|
|
* This function checks the ring, descriptor, and packet information in
|
|
|
|
* order to populate the hash, checksum, VLAN, protocol, and
|
|
|
|
* other fields within the skb.
|
|
|
|
*/
|
|
|
|
void
|
2021-08-19 13:59:58 +02:00
|
|
|
ice_process_skb_fields(struct ice_rx_ring *rx_ring,
|
2019-11-04 09:38:56 -08:00
|
|
|
union ice_32b_rx_flex_desc *rx_desc,
|
2023-12-05 22:08:32 +01:00
|
|
|
struct sk_buff *skb)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
2023-12-05 22:08:32 +01:00
|
|
|
u16 ptype = ice_get_ptype(rx_desc);
|
|
|
|
|
2023-12-05 22:08:30 +01:00
|
|
|
ice_rx_hash_to_skb(rx_ring, rx_desc, skb, ptype);
|
2019-11-04 09:38:56 -08:00
|
|
|
|
|
|
|
/* modifies the skb - consumes the enet header */
|
2024-03-01 12:54:13 +01:00
|
|
|
if (unlikely(rx_ring->flags & ICE_RX_FLAGS_MULTIDEV)) {
|
|
|
|
struct net_device *netdev = ice_eswitch_get_target(rx_ring,
|
|
|
|
rx_desc);
|
|
|
|
|
2024-03-01 12:54:14 +01:00
|
|
|
if (ice_is_port_repr_netdev(netdev))
|
|
|
|
ice_repr_inc_rx_stats(netdev, skb->len);
|
2024-03-01 12:54:13 +01:00
|
|
|
skb->protocol = eth_type_trans(skb, netdev);
|
|
|
|
} else {
|
|
|
|
skb->protocol = eth_type_trans(skb, rx_ring->netdev);
|
|
|
|
}
|
2019-11-04 09:38:56 -08:00
|
|
|
|
|
|
|
ice_rx_csum(rx_ring, skb, rx_desc, ptype);
|
ice: enable receive hardware timestamping
Add SIOCGHWTSTAMP and SIOCSHWTSTAMP ioctl handlers to respond to
requests to enable timestamping support. If the request is for enabling
Rx timestamps, set a bit in the Rx descriptors to indicate that receive
timestamps should be reported.
Hardware captures receive timestamps in the PHY which only captures part
of the timer, and reports only 40 bits into the Rx descriptor. The upper
32 bits represent the contents of GLTSYN_TIME_L at the point of packet
reception, while the lower 8 bits represent the upper 8 bits of
GLTSYN_TIME_0.
The networking and PTP stack expect 64 bit timestamps in nanoseconds. To
support this, implement some logic to extend the timestamps by using the
full PHC time.
If the Rx timestamp was captured prior to the PHC time, then the real
timestamp is
PHC - (lower_32_bits(PHC) - timestamp)
If the Rx timestamp was captured after the PHC time, then the real
timestamp is
PHC + (timestamp - lower_32_bits(PHC))
These calculations are correct as long as neither the PHC timestamp nor
the Rx timestamps are more than 2^32-1 nanseconds old. Further, we can
detect when the Rx timestamp is before or after the PHC as long as the
PHC timestamp is no more than 2^31-1 nanoseconds old.
In that case, we calculate the delta between the lower 32 bits of the
PHC and the Rx timestamp. If it's larger than 2^31-1 then the Rx
timestamp must have been captured in the past. If it's smaller, then the
Rx timestamp must have been captured after PHC time.
Add an ice_ptp_extend_32b_ts function that relies on a cached copy of
the PHC time and implements this algorithm to calculate the proper upper
32bits of the Rx timestamps.
Cache the PHC time periodically in all of the Rx rings. This enables
each Rx ring to simply call the extension function with a recent copy of
the PHC time. By ensuring that the PHC time is kept up to date
periodically, we ensure this algorithm doesn't use stale data and
produce incorrect results.
To cache the time, introduce a kworker and a kwork item to periodically
store the Rx time. It might seem like we should use the .do_aux_work
interface of the PTP clock. This doesn't work because all PFs must cache
this time, but only one PF owns the PTP clock device.
Thus, the ice driver will manage its own kthread instead of relying on
the PTP do_aux_work handler.
With this change, the driver can now report Rx timestamps on all
incoming packets.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Tony Brelinski <tonyx.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-09 09:39:52 -07:00
|
|
|
|
|
|
|
if (rx_ring->ptp_rx)
|
2023-12-05 22:08:31 +01:00
|
|
|
ice_ptp_rx_hwts_to_skb(rx_ring, rx_desc, skb);
|
2019-11-04 09:38:56 -08:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* ice_receive_skb - Send a completed packet up the stack
|
|
|
|
* @rx_ring: Rx ring in play
|
|
|
|
* @skb: packet to send up
|
2023-12-05 22:08:40 +01:00
|
|
|
* @vlan_tci: VLAN TCI for packet
|
2019-11-04 09:38:56 -08:00
|
|
|
*
|
|
|
|
* This function sends the completed packet (via. skb) up the stack using
|
|
|
|
* gro receive functions (with/without VLAN tag)
|
|
|
|
*/
|
|
|
|
void
|
2023-12-05 22:08:40 +01:00
|
|
|
ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tci)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
2023-12-05 22:08:40 +01:00
|
|
|
if ((vlan_tci & VLAN_VID_MASK) && rx_ring->vlan_proto)
|
|
|
|
__vlan_hwaccel_put_tag(skb, rx_ring->vlan_proto,
|
|
|
|
vlan_tci);
|
ice: Add hot path support for 802.1Q and 802.1ad VLAN offloads
Currently the driver only supports 802.1Q VLAN insertion and stripping.
However, once Double VLAN Mode (DVM) is fully supported, then both 802.1Q
and 802.1ad VLAN insertion and stripping will be supported. Unfortunately
the VSI context parameters only allow for one VLAN ethertype at a time
for VLAN offloads so only one or the other VLAN ethertype offload can be
supported at once.
To support this, multiple changes are needed.
Rx path changes:
[1] In DVM, the Rx queue context l2tagsel field needs to be cleared so
the outermost tag shows up in the l2tag2_2nd field of the Rx flex
descriptor. In Single VLAN Mode (SVM), the l2tagsel field should remain
1 to support SVM configurations.
[2] Modify the ice_test_staterr() function to take a __le16 instead of
the ice_32b_rx_flex_desc union pointer so this function can be used for
both rx_desc->wb.status_error0 and rx_desc->wb.status_error1.
[3] Add the new inline function ice_get_vlan_tag_from_rx_desc() that
checks if there is a VLAN tag in l2tag1 or l2tag2_2nd.
[4] In ice_receive_skb(), add a check to see if NETIF_F_HW_VLAN_STAG_RX
is enabled in netdev->features. If it is, then this is the VLAN
ethertype that needs to be added to the stripping VLAN tag. Since
ice_fix_features() prevents CTAG_RX and STAG_RX from being enabled
simultaneously, the VLAN ethertype will only ever be 802.1Q or 802.1ad.
Tx path changes:
[1] In DVM, the VLAN tag needs to be placed in the l2tag2 field of the Tx
context descriptor. The new define ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN was
added to the list of tx_flags to handle this case.
[2] When the stack requests the VLAN tag to be offloaded on Tx, the
driver needs to set either ICE_TX_FLAGS_HW_OUTER_SINGLE_VLAN or
ICE_TX_FLAGS_HW_VLAN, so the tag is inserted in l2tag2 or l2tag1
respectively. To determine which location to use, set a bit in the Tx
ring flags field during ring allocation that can be used to determine
which field to use in the Tx descriptor. In DVM, always use l2tag2,
and in SVM, always use l2tag1.
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-12-02 08:38:47 -08:00
|
|
|
|
2021-01-08 03:39:02 -08:00
|
|
|
napi_gro_receive(&rx_ring->q_vector->napi, skb);
|
2019-11-04 09:38:56 -08:00
|
|
|
}
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
/**
|
|
|
|
* ice_clean_xdp_tx_buf - Free and unmap XDP Tx buffer
|
2023-02-10 18:06:17 +01:00
|
|
|
* @dev: device for DMA mapping
|
2023-01-31 21:45:04 +01:00
|
|
|
* @tx_buf: Tx buffer to clean
|
2023-02-10 18:06:17 +01:00
|
|
|
* @bq: XDP bulk flush struct
|
2023-01-31 21:45:04 +01:00
|
|
|
*/
|
|
|
|
static void
|
2023-02-10 18:06:17 +01:00
|
|
|
ice_clean_xdp_tx_buf(struct device *dev, struct ice_tx_buf *tx_buf,
|
|
|
|
struct xdp_frame_bulk *bq)
|
2023-01-31 21:45:04 +01:00
|
|
|
{
|
2023-02-10 18:06:17 +01:00
|
|
|
dma_unmap_single(dev, dma_unmap_addr(tx_buf, dma),
|
2023-01-31 21:45:04 +01:00
|
|
|
dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
|
|
|
|
dma_unmap_len_set(tx_buf, len, 0);
|
ice: Robustify cleaning/completing XDP Tx buffers
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
|
|
|
|
|
|
|
switch (tx_buf->type) {
|
|
|
|
case ICE_TX_BUF_XDP_TX:
|
|
|
|
page_frag_free(tx_buf->raw_buf);
|
|
|
|
break;
|
2023-02-10 18:06:17 +01:00
|
|
|
case ICE_TX_BUF_XDP_XMIT:
|
|
|
|
xdp_return_frame_bulk(tx_buf->xdpf, bq);
|
|
|
|
break;
|
ice: Robustify cleaning/completing XDP Tx buffers
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
tx_buf->type = ICE_TX_BUF_EMPTY;
|
2023-01-31 21:45:04 +01:00
|
|
|
}
|
|
|
|
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
/**
|
|
|
|
* ice_clean_xdp_irq - Reclaim resources after transmit completes on XDP ring
|
|
|
|
* @xdp_ring: XDP ring to clean
|
|
|
|
*/
|
2023-01-31 21:45:04 +01:00
|
|
|
static u32 ice_clean_xdp_irq(struct ice_tx_ring *xdp_ring)
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
{
|
2023-01-31 21:45:04 +01:00
|
|
|
int total_bytes = 0, total_pkts = 0;
|
2023-02-10 18:06:17 +01:00
|
|
|
struct device *dev = xdp_ring->dev;
|
2023-01-31 21:45:04 +01:00
|
|
|
u32 ntc = xdp_ring->next_to_clean;
|
|
|
|
struct ice_tx_desc *tx_desc;
|
|
|
|
u32 cnt = xdp_ring->count;
|
2023-02-10 18:06:17 +01:00
|
|
|
struct xdp_frame_bulk bq;
|
2023-02-10 18:06:13 +01:00
|
|
|
u32 frags, xdp_tx = 0;
|
2023-01-31 21:45:04 +01:00
|
|
|
u32 ready_frames = 0;
|
|
|
|
u32 idx;
|
|
|
|
u32 ret;
|
|
|
|
|
|
|
|
idx = xdp_ring->tx_buf[ntc].rs_idx;
|
|
|
|
tx_desc = ICE_TX_DESC(xdp_ring, idx);
|
|
|
|
if (tx_desc->cmd_type_offset_bsz &
|
|
|
|
cpu_to_le64(ICE_TX_DESC_DTYPE_DESC_DONE)) {
|
|
|
|
if (idx >= ntc)
|
|
|
|
ready_frames = idx - ntc + 1;
|
|
|
|
else
|
|
|
|
ready_frames = idx + cnt - ntc + 1;
|
|
|
|
}
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
|
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
|
|
|
if (unlikely(!ready_frames))
|
2023-01-31 21:45:04 +01:00
|
|
|
return 0;
|
|
|
|
ret = ready_frames;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
|
2023-02-10 18:06:17 +01:00
|
|
|
xdp_frame_bulk_init(&bq);
|
|
|
|
rcu_read_lock(); /* xdp_return_frame_bulk() */
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
while (ready_frames) {
|
|
|
|
struct ice_tx_buf *tx_buf = &xdp_ring->tx_buf[ntc];
|
ice: Robustify cleaning/completing XDP Tx buffers
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
|
|
|
struct ice_tx_buf *head = tx_buf;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
/* bytecount holds size of head + frags */
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
total_bytes += tx_buf->bytecount;
|
2023-01-31 21:45:04 +01:00
|
|
|
frags = tx_buf->nr_frags;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
total_pkts++;
|
2023-01-31 21:45:04 +01:00
|
|
|
/* count head + frags */
|
|
|
|
ready_frames -= frags + 1;
|
2023-02-10 18:06:13 +01:00
|
|
|
xdp_tx++;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
|
|
|
|
ntc++;
|
2023-01-31 21:45:04 +01:00
|
|
|
if (ntc == cnt)
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
ntc = 0;
|
2023-01-31 21:45:04 +01:00
|
|
|
|
|
|
|
for (int i = 0; i < frags; i++) {
|
|
|
|
tx_buf = &xdp_ring->tx_buf[ntc];
|
|
|
|
|
2023-02-10 18:06:17 +01:00
|
|
|
ice_clean_xdp_tx_buf(dev, tx_buf, &bq);
|
2023-01-31 21:45:04 +01:00
|
|
|
ntc++;
|
|
|
|
if (ntc == cnt)
|
|
|
|
ntc = 0;
|
|
|
|
}
|
ice: Robustify cleaning/completing XDP Tx buffers
When queueing frames from a Page Pool for redirecting to a device backed
by the ice driver, `perf top` shows heavy load on page_alloc() and
page_frag_free(), despite that on a properly working system it must be
fully or at least almost zero-alloc. The problem is in fact a bit deeper
and raises from how ice cleans up completed Tx buffers.
The story so far: when cleaning/freeing the resources related to
a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some
heuristics only without setting any type explicitly (except for dummy
Flow Director packets, which are marked via ice_tx_buf::tx_flags).
This kinda works, but only up to some point. For example, currently ice
assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by
either plain order-0 page or plain page frag, while it may also be
backed by Page Pool or any other possible memory models introduced in
future. This means any &xdp_frame must be freed properly via
xdp_return_frame() family with no assumptions.
In order to do that, the whole heuristics must be replaced with setting
the Tx buffer/frame type explicitly, just how it's always been done via
an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't
hurt much -- especially given that sometimes there was a check for
%ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum
member. The rest of the changes is straightforward and most of it is
just a conversion to rely now on the type set in &ice_tx_buf rather than
to some secondary properties.
For now, no functional changes intended, the change only prepares the
ground for starting freeing XDP frames properly next step. And it must
be done atomically/synchronously to not break stuff.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-10 18:06:16 +01:00
|
|
|
|
2023-02-10 18:06:17 +01:00
|
|
|
ice_clean_xdp_tx_buf(dev, head, &bq);
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
}
|
|
|
|
|
2023-02-10 18:06:17 +01:00
|
|
|
xdp_flush_frame_bulk(&bq);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
tx_desc->cmd_type_offset_bsz = 0;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
xdp_ring->next_to_clean = ntc;
|
2023-02-10 18:06:13 +01:00
|
|
|
xdp_ring->xdp_tx_active -= xdp_tx;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
ice_update_tx_ring_stats(xdp_ring, total_pkts, total_bytes);
|
2023-01-31 21:45:04 +01:00
|
|
|
|
|
|
|
return ret;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
}
|
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
/**
|
2023-01-31 21:45:04 +01:00
|
|
|
* __ice_xmit_xdp_ring - submit frame to XDP ring for transmission
|
|
|
|
* @xdp: XDP buffer to be placed onto Tx descriptors
|
2019-11-04 09:38:56 -08:00
|
|
|
* @xdp_ring: XDP ring for transmission
|
2023-02-10 18:06:17 +01:00
|
|
|
* @frame: whether this comes from .ndo_xdp_xmit()
|
2019-11-04 09:38:56 -08:00
|
|
|
*/
|
2023-02-10 18:06:17 +01:00
|
|
|
int __ice_xmit_xdp_ring(struct xdp_buff *xdp, struct ice_tx_ring *xdp_ring,
|
|
|
|
bool frame)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
2023-01-31 21:45:04 +01:00
|
|
|
struct skb_shared_info *sinfo = NULL;
|
|
|
|
u32 size = xdp->data_end - xdp->data;
|
|
|
|
struct device *dev = xdp_ring->dev;
|
|
|
|
u32 ntu = xdp_ring->next_to_use;
|
2019-11-04 09:38:56 -08:00
|
|
|
struct ice_tx_desc *tx_desc;
|
2023-01-31 21:45:04 +01:00
|
|
|
struct ice_tx_buf *tx_head;
|
2019-11-04 09:38:56 -08:00
|
|
|
struct ice_tx_buf *tx_buf;
|
2023-01-31 21:45:04 +01:00
|
|
|
u32 cnt = xdp_ring->count;
|
|
|
|
void *data = xdp->data;
|
|
|
|
u32 nr_frags = 0;
|
|
|
|
u32 free_space;
|
|
|
|
u32 frag = 0;
|
|
|
|
|
|
|
|
free_space = ICE_DESC_UNUSED(xdp_ring);
|
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
|
|
|
if (free_space < ICE_RING_QUARTER(xdp_ring))
|
2023-01-31 21:45:04 +01:00
|
|
|
free_space += ice_clean_xdp_irq(xdp_ring);
|
|
|
|
|
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
|
|
|
if (unlikely(!free_space))
|
|
|
|
goto busy;
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
if (unlikely(xdp_buff_has_frags(xdp))) {
|
|
|
|
sinfo = xdp_get_shared_info_from_buff(xdp);
|
|
|
|
nr_frags = sinfo->nr_frags;
|
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
|
|
|
if (free_space < nr_frags + 1)
|
|
|
|
goto busy;
|
2023-01-31 21:45:04 +01:00
|
|
|
}
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
tx_desc = ICE_TX_DESC(xdp_ring, ntu);
|
|
|
|
tx_head = &xdp_ring->tx_buf[ntu];
|
|
|
|
tx_buf = tx_head;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
for (;;) {
|
|
|
|
dma_addr_t dma;
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
dma = dma_map_single(dev, data, size, DMA_TO_DEVICE);
|
|
|
|
if (dma_mapping_error(dev, dma))
|
|
|
|
goto dma_unmap;
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
/* record length, and DMA address */
|
|
|
|
dma_unmap_len_set(tx_buf, len, size);
|
|
|
|
dma_unmap_addr_set(tx_buf, dma, dma);
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-02-10 18:06:17 +01:00
|
|
|
if (frame) {
|
|
|
|
tx_buf->type = ICE_TX_BUF_FRAG;
|
|
|
|
} else {
|
|
|
|
tx_buf->type = ICE_TX_BUF_XDP_TX;
|
|
|
|
tx_buf->raw_buf = data;
|
|
|
|
}
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
tx_desc->buf_addr = cpu_to_le64(dma);
|
|
|
|
tx_desc->cmd_type_offset_bsz = ice_build_ctob(0, 0, size, 0);
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
ntu++;
|
|
|
|
if (ntu == cnt)
|
|
|
|
ntu = 0;
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
if (frag == nr_frags)
|
|
|
|
break;
|
|
|
|
|
|
|
|
tx_desc = ICE_TX_DESC(xdp_ring, ntu);
|
|
|
|
tx_buf = &xdp_ring->tx_buf[ntu];
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
data = skb_frag_address(&sinfo->frags[frag]);
|
|
|
|
size = skb_frag_size(&sinfo->frags[frag]);
|
|
|
|
frag++;
|
ice: optimize XDP_TX workloads
Optimize Tx descriptor cleaning for XDP. Current approach doesn't
really scale and chokes when multiple flows are handled.
Introduce two ring fields, @next_dd and @next_rs that will keep track of
descriptor that should be looked at when the need for cleaning arise and
the descriptor that should have the RS bit set, respectively.
Note that at this point the threshold is a constant (32), but it is
something that we could make configurable.
First thing is to get away from setting RS bit on each descriptor. Let's
do this only once NTU is higher than the currently @next_rs value. In
such case, grab the tx_desc[next_rs], set the RS bit in descriptor and
advance the @next_rs by a 32.
Second thing is to clean the Tx ring only when there are less than 32
free entries. For that case, look up the tx_desc[next_dd] for a DD bit.
This bit is written back by HW to let the driver know that xmit was
successful. It will happen only for those descriptors that had RS bit
set. Clean only 32 descriptors and advance the DD bit.
Actual cleaning routine is moved from ice_napi_poll() down to the
ice_xmit_xdp_ring(). It is safe to do so as XDP ring will not get any
SKBs in there that would rely on interrupts for the cleaning. Nice side
effect is that for rare case of Tx fallback path (that next patch is
going to introduce) we don't have to trigger the SW irq to clean the
ring.
With those two concepts, ring is kept at being almost full, but it is
guaranteed that driver will be able to produce Tx descriptors.
This approach seems to work out well even though the Tx descriptors are
produced in one-by-one manner. Test was conducted with the ice HW
bombarded with packets from HW generator, configured to generate 30
flows.
Xdp2 sample yields the following results:
<snip>
proto 17: 79973066 pkt/s
proto 17: 80018911 pkt/s
proto 17: 80004654 pkt/s
proto 17: 79992395 pkt/s
proto 17: 79975162 pkt/s
proto 17: 79955054 pkt/s
proto 17: 79869168 pkt/s
proto 17: 79823947 pkt/s
proto 17: 79636971 pkt/s
</snip>
As that sample reports the Rx'ed frames, let's look at sar output.
It says that what we Rx'ed we do actually Tx, no noticeable drops.
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 79842324.00 79842310.40 4678261.17 4678260.38 0.00 0.00 0.00 38.32
with tx_busy staying calm.
When compared to a state before:
Average: IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s %ifutil
Average: ens4f1 90919711.60 42233822.60 5327326.85 2474638.04 0.00 0.00 0.00 43.64
it can be observed that the amount of txpck/s is almost doubled, meaning
that the performance is improved by around 90%. All of this due to the
drops in the driver, previously the tx_busy stat was bumped at a 7mpps
rate.
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-08-19 14:00:02 +02:00
|
|
|
}
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
/* store info about bytecount and frag count in first desc */
|
|
|
|
tx_head->bytecount = xdp_get_buff_len(xdp);
|
|
|
|
tx_head->nr_frags = nr_frags;
|
|
|
|
|
2023-02-10 18:06:17 +01:00
|
|
|
if (frame) {
|
|
|
|
tx_head->type = ICE_TX_BUF_XDP_XMIT;
|
|
|
|
tx_head->xdpf = xdp->data_hard_start;
|
|
|
|
}
|
|
|
|
|
2023-01-31 21:45:04 +01:00
|
|
|
/* update last descriptor from a frame with EOP */
|
|
|
|
tx_desc->cmd_type_offset_bsz |=
|
|
|
|
cpu_to_le64(ICE_TX_DESC_CMD_EOP << ICE_TXD_QW1_CMD_S);
|
|
|
|
|
|
|
|
xdp_ring->xdp_tx_active++;
|
|
|
|
xdp_ring->next_to_use = ntu;
|
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
return ICE_XDP_TX;
|
2023-01-31 21:45:04 +01:00
|
|
|
|
|
|
|
dma_unmap:
|
|
|
|
for (;;) {
|
|
|
|
tx_buf = &xdp_ring->tx_buf[ntu];
|
|
|
|
dma_unmap_page(dev, dma_unmap_addr(tx_buf, dma),
|
|
|
|
dma_unmap_len(tx_buf, len), DMA_TO_DEVICE);
|
|
|
|
dma_unmap_len_set(tx_buf, len, 0);
|
|
|
|
if (tx_buf == tx_head)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (!ntu)
|
|
|
|
ntu += cnt;
|
|
|
|
ntu--;
|
|
|
|
}
|
|
|
|
return ICE_XDP_CONSUMED;
|
ice: Fix XDP Tx ring overrun
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic
generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due
to the driver not freeing all of the pages passed to it by
.ndo_xdp_xmit().
Turned out that during the development of the tagged commit, the check,
which ensures that we have a free descriptor to queue a frame, moved
into the branch happening only when a buffer has frags. Otherwise, we
only run a cleaning cycle, but don't check anything.
ATST, there can be situations when the driver gets new frames to send,
but there are no buffers that can be cleaned/completed and the ring has
no free slots. It's very rare, but still possible (> 6.5 Mpps per ring).
The driver then fills the next buffer/descriptor, effectively
overwriting the data, which still needs to be freed.
Restore the check after the cleaning routine to make sure there is a
slot to queue a new frame. When there are frags, there still will be a
separate check that we can place all of them, but if the ring is full,
there's no point in wasting any more time.
(minor: make `!ready_frames` unlikely since it happens ~1-2 times per
billion of frames)
Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-10 18:06:14 +01:00
|
|
|
|
|
|
|
busy:
|
|
|
|
xdp_ring->ring_stats->tx_stats.tx_busy++;
|
|
|
|
|
|
|
|
return ICE_XDP_CONSUMED;
|
2023-01-31 21:45:04 +01:00
|
|
|
}
|
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
/**
|
|
|
|
* ice_finalize_xdp_rx - Bump XDP Tx tail and/or flush redirect map
|
2021-08-19 14:00:01 +02:00
|
|
|
* @xdp_ring: XDP ring
|
2019-11-04 09:38:56 -08:00
|
|
|
* @xdp_res: Result of the receive batch
|
2023-03-13 13:36:07 -07:00
|
|
|
* @first_idx: index to write from caller
|
2019-11-04 09:38:56 -08:00
|
|
|
*
|
|
|
|
* This function bumps XDP Tx tail and/or flush redirect map, and
|
|
|
|
* should be called when a batch of packets has been processed in the
|
|
|
|
* napi loop.
|
|
|
|
*/
|
2023-01-31 21:45:04 +01:00
|
|
|
void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res,
|
|
|
|
u32 first_idx)
|
2019-11-04 09:38:56 -08:00
|
|
|
{
|
2023-01-31 21:45:04 +01:00
|
|
|
struct ice_tx_buf *tx_buf = &xdp_ring->tx_buf[first_idx];
|
|
|
|
|
2019-11-04 09:38:56 -08:00
|
|
|
if (xdp_res & ICE_XDP_REDIR)
|
2023-09-08 16:32:14 +02:00
|
|
|
xdp_do_flush();
|
2019-11-04 09:38:56 -08:00
|
|
|
|
2021-08-19 14:00:03 +02:00
|
|
|
if (xdp_res & ICE_XDP_TX) {
|
|
|
|
if (static_branch_unlikely(&ice_xdp_locking_key))
|
|
|
|
spin_lock(&xdp_ring->tx_lock);
|
2023-01-31 21:45:04 +01:00
|
|
|
/* store index of descriptor with RS bit set in the first
|
|
|
|
* ice_tx_buf of given NAPI batch
|
|
|
|
*/
|
|
|
|
tx_buf->rs_idx = ice_set_rs_bit(xdp_ring);
|
2019-11-04 09:38:56 -08:00
|
|
|
ice_xdp_ring_update_tail(xdp_ring);
|
2021-08-19 14:00:03 +02:00
|
|
|
if (static_branch_unlikely(&ice_xdp_locking_key))
|
|
|
|
spin_unlock(&xdp_ring->tx_lock);
|
|
|
|
}
|
2019-11-04 09:38:56 -08:00
|
|
|
}
|
2023-12-05 22:08:34 +01:00
|
|
|
|
|
|
|
/**
|
|
|
|
* ice_xdp_rx_hw_ts - HW timestamp XDP hint handler
|
|
|
|
* @ctx: XDP buff pointer
|
|
|
|
* @ts_ns: destination address
|
|
|
|
*
|
|
|
|
* Copy HW timestamp (if available) to the destination address.
|
|
|
|
*/
|
|
|
|
static int ice_xdp_rx_hw_ts(const struct xdp_md *ctx, u64 *ts_ns)
|
|
|
|
{
|
|
|
|
const struct ice_xdp_buff *xdp_ext = (void *)ctx;
|
|
|
|
|
|
|
|
*ts_ns = ice_ptp_get_rx_hwts(xdp_ext->eop_desc,
|
|
|
|
xdp_ext->pkt_ctx);
|
|
|
|
if (!*ts_ns)
|
|
|
|
return -ENODATA;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-12-05 22:08:35 +01:00
|
|
|
/**
|
|
|
|
* ice_xdp_rx_hash_type - Get XDP-specific hash type from the RX descriptor
|
|
|
|
* @eop_desc: End of Packet descriptor
|
|
|
|
*/
|
|
|
|
static enum xdp_rss_hash_type
|
|
|
|
ice_xdp_rx_hash_type(const union ice_32b_rx_flex_desc *eop_desc)
|
|
|
|
{
|
net: intel: introduce {, Intel} Ethernet common library
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules.
Before introducing new changes, which would need to be copied over again,
start decoupling the already existing duplicate functionality into a new
module, which will be shared between several Intel Ethernet drivers.
Add the lookup table which converts 8/10-bit hardware packet type into
a parsed bitfield structure for easy checking packet format parameters,
such as payload level, IP version, etc. This is currently used by i40e,
ice and iavf and it's all the same in all three drivers.
The only difference introduced in this implementation is that instead of
defining a 256 (or 1024 in case of ice) element array, add unlikely()
condition to limit the input to 154 (current maximum non-reserved packet
type). There's no reason to waste 600 (or even 3600) bytes only to not
hurt very unlikely exception packets.
The hash computation function now takes payload level directly as a
pkt_hash_type. There's a couple cases when non-IP ptypes are marked as
L3 payload and in the previous versions their hash level would be 2, not
3. But skb_set_hash() only sees difference between L4 and non-L4, thus
this won't change anything at all.
The module is behind the hidden Kconfig symbol, which the drivers will
select when needed. The exports are behind 'LIBIE' namespace to limit
the scope of the functions.
Not that non-HW-specific symbols will live in yet another module,
libeth. This is done to easily distinguish pretty generic code ready
for reusing by any other vendor and/or for moving the layer up from
the code useful in Intel's 1-100G drivers only.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-04-18 13:36:07 +02:00
|
|
|
return libie_rx_pt_parse(ice_get_ptype(eop_desc)).hash_type;
|
2023-12-05 22:08:35 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* ice_xdp_rx_hash - RX hash XDP hint handler
|
|
|
|
* @ctx: XDP buff pointer
|
|
|
|
* @hash: hash destination address
|
|
|
|
* @rss_type: XDP hash type destination address
|
|
|
|
*
|
|
|
|
* Copy RX hash (if available) and its type to the destination address.
|
|
|
|
*/
|
|
|
|
static int ice_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
|
|
|
|
enum xdp_rss_hash_type *rss_type)
|
|
|
|
{
|
|
|
|
const struct ice_xdp_buff *xdp_ext = (void *)ctx;
|
|
|
|
|
|
|
|
*hash = ice_get_rx_hash(xdp_ext->eop_desc);
|
|
|
|
*rss_type = ice_xdp_rx_hash_type(xdp_ext->eop_desc);
|
|
|
|
if (!likely(*hash))
|
|
|
|
return -ENODATA;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-12-05 22:08:39 +01:00
|
|
|
/**
|
|
|
|
* ice_xdp_rx_vlan_tag - VLAN tag XDP hint handler
|
|
|
|
* @ctx: XDP buff pointer
|
|
|
|
* @vlan_proto: destination address for VLAN protocol
|
|
|
|
* @vlan_tci: destination address for VLAN TCI
|
|
|
|
*
|
|
|
|
* Copy VLAN tag (if was stripped) and corresponding protocol
|
|
|
|
* to the destination address.
|
|
|
|
*/
|
|
|
|
static int ice_xdp_rx_vlan_tag(const struct xdp_md *ctx, __be16 *vlan_proto,
|
|
|
|
u16 *vlan_tci)
|
|
|
|
{
|
|
|
|
const struct ice_xdp_buff *xdp_ext = (void *)ctx;
|
|
|
|
|
|
|
|
*vlan_proto = xdp_ext->pkt_ctx->vlan_proto;
|
|
|
|
if (!*vlan_proto)
|
|
|
|
return -ENODATA;
|
|
|
|
|
|
|
|
*vlan_tci = ice_get_vlan_tci(xdp_ext->eop_desc);
|
|
|
|
if (!*vlan_tci)
|
|
|
|
return -ENODATA;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-12-05 22:08:34 +01:00
|
|
|
const struct xdp_metadata_ops ice_xdp_md_ops = {
|
|
|
|
.xmo_rx_timestamp = ice_xdp_rx_hw_ts,
|
2023-12-05 22:08:35 +01:00
|
|
|
.xmo_rx_hash = ice_xdp_rx_hash,
|
2023-12-05 22:08:39 +01:00
|
|
|
.xmo_rx_vlan_tag = ice_xdp_rx_vlan_tag,
|
2023-12-05 22:08:34 +01:00
|
|
|
};
|