In bnxt_alloc_ctx_mem(), the logic to set up the context memory entries
and to allocate the context memory tables is done repetitively. Add
a helper function to simplify the code.
The setup of the Fast Path TQM entries relies on some information from
the Slow Path TQM entries. Copy the SP_TQM entries to the FP_TQM
entries to simplify the logic.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-7-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the newly added pg_info field in bnxt_ctx_mem_type struct and
remove the standalone page info structures in bnxt_ctx_mem_info.
This now completes the reorganization of the context memory
structures to work better with the new and more flexible firmware
interface for newer chips.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This will further improve the organization of the bnxt_ctx_mem_info
structure by moving the standalone page info structures into the
bnxt_ctx_mem_type array. Add the allocation and free logic first and
the next patch will migrate to use the new infrastructure.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current code uses a flat bnxt_ctx_mem_info structure to store 8
types of context memory for the NIC. All the context memory types
are very similar and have similar parameters. They can all share a
common structure to improve the organization. Also, new firmware
interface will provide a new API to retrieve each type of context
memory by calling the API repeatedly.
This patch reorganizes the bnxt_ctx_mem_info structure to fit better
with the new firmware interface. It will also work with the legacy
firmware interface. The flat fields in bnxt_ctx_mem_info are replaced
by the bnxt_ctx_mem_type array. The bnxt_mem_init array info will no
longer be needed.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We always free bp->ctx right after calling bnxt_free_ctx_mem(), so just
free it at the end of that function to make things simpler.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
bnxt_alloc_ctx_mem() calls bnxt_hwrm_func_backing_store_qcaps() to
allocate the memory for bp->ctx. Initialize bp->ctx with the allocated
memory and let the caller free it during unwind. The unwind logic is
already there, we just need to always set bp->ctx to the allocated
memory so the caller will always free it. This simplifies the logic
and makes it easier to expand on the backing store logic.
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231120234405.194542-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we use the cumulative consumer index scheme for TX completion,
we don't need to have one TX completion per TX packet in the xmit_more
code path. Set the TX_BD_FLAGS_NO_CMPL flag if xmit_more is true.
Fallback to one interrupt per packet if the ring is filled beyond
bp->tx_wake_thresh.
Also, move the wmb() to bnxt_txr_db_kick(). When xmit_more is true,
we'll skip the bnxt_txr_db_kick() call and there is no need to call
wmb() to sync. the TX BD data.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can now fully support sharing the same MSIX for all mqprio TX rings
belonging to the same ethtool channel with the new infrastructure:
1. Allocate the proper entries for cp_ring_arr in struct bnxt_cp_ring_info
to support the additional TX rings.
2. Populate the tx_ring array in struct bnxt_napi for all TX rings
sharing the same NAPI.
3. bnxt_num_tx_to_cp() returns the proper NQ/completion rings to support
the TX rings in the input.
4. Adjust bnxt_get_num_ring_stats() for the reduced number of ring
counters with the new scheme.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add 3 macros that handle to conversions between TC numbers and TX
ring numbers. These will help to clarify the existing logic and the
new logic in the next patch.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Up until now, each TX ring always requires a completion ring/NQ/MSIX.
bnxt_trim_rings() and the assignment of bp->cp_nr_rings always make
this assumption. This will no longer be true in the next patches, so
we refactor and add helper functions to determine the proper relationship
between TX rings and the required completion ring/NQ/MSIX. This patch
does not change the 1:1 relationship yet.
Note that on P5 chips, each RX and TX ring still requires a completion
ring. Only the number of NQs has been reduced. We should no longer call
bnxt_trim_rings() to adjust the RX and TX rings on P5 chips. Replace with
simple logic to check that RX + TX < CP and adjust accordingly.
bnxt_check_rings() should call _bnxt_get_max_rings() to get the raw
number of rings instead of bnxt_get_max_rings(). If we are about to
create TCs, bnxt_get_max_rings() would not be able to calculate the max
rings correctly.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For each mqprio TC, we allocate a set of TX rings to map to the new
hardware CoS queue. Expand the tx_ring pointer in struct bnxt_napi
to an array of 8 to support up to 8 TX rings, one for each TC.
Only array entry 0 is used at this time. The rest of the array
entries will be used in later patches.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add 2 helper functions to set coalescing for each RX and TX rings. This
will make it easier to expand the number of TX rings per MSIX in the
next patches.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support multiple TX rings on the same MSIX, we'll use the
upper byte of the TX opaque field to store the ring index in the new
tx_napi_idx field. This tx_napi_idx field is currently always 0 until
more infrastructure is added in later patches.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bnxt_tx_int() processes the only one TX ring from the bnxt_napi pointer.
To prepare for more TX rings associated with the bnxt_napi structure,
add a new __bnxt_tx_int() function that takes the bnxt_tx_ring_info
pointer to process that one TX ring. No functional change.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These 2 constants were used for the one RX and one TX completion
ring pointer in the cpr->cp_ring_arr fixed array. Now that we've
changed to allocating the array for the exact number of entries to
support more TX rings, we no longer use these constants.
The array index as well as the type of completion ring (RX/TX) are
now encoded in the handle for the completion ring. This will allow
us to locate the completion ring during NAPI for any number of
completion rings sharing the same MSIX. In the following patches,
we'll be adding support for more TX rings associated with the same
MSIX vector.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From the TX or RX ring structure, we need to find the corresponding
completion ring during initialization. On P5 chips, we use the MSIX/napi
entry to locate the completion ring because there is only one RX/TX
ring per MSIX. To allow multiple TX rings for each MSIX, we need
to add a direct pointer from the TX ring and RX ring structures.
This also simplifies the existing logic.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cp_ring_arr is currently a fixed array of 2 pointers for the
TX and RX completion rings. These pointers are allocated during
ring initialization. Currntly, we support up to 2 completion rings
for each MSIX. In order to support more completion rings, we change
this fixed array to a pointer and allocate the required entries
during ring initialization. This patch keeps the current scheme of
allocating only 2 entries when needed. Later patches will expand
and allocate more entries when required.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
From the TX or RX ring structure, we need to find the corresponding
completion ring during initialization. On P5 chips, we use the MSIX/napi
entry to locate the completion ring because there is only one RX/TX
ring per MSIX. To allow multiple TX rings for each MSIX, we need
to add a direct pointer from the TX ring and RX ring structures.
This also simplifies the existing logic.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the opaque field in the TX BD is only used for debugging.
The TX completion logic relies on getting one TX completion for each
packet and they always complete in order.
Improve this scheme by putting the producer information (ring index plus
number of BDs for the packet) in the opaque field. This way, we can
handle TX completion processing by looking at the last TX completion
instead of counting the number of completions.
Since we no longer need to count the exact number of completions, we can
optimize xmit_more by disabling TX completion when the xmit_more
condition is true. This will be done in later patches.
This patch is only initializing the opaque field in the TX BD and is
not changing the driver's TX completion logic yet.
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent firmware interface change has added 2 counters in struct
rx_port_stats_ext. This caused 2 stray ethtool counters to be
displayed.
Since new counters are added from time to time, fix it so that the
ethtool logic will only display up to the maximum known counters.
These 2 counters are not used by production firmware yet.
Fixes: 754fbf604f ("bnxt_en: Update firmware interface to 1.10.2.171")
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231026013231.53271-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
PP_FLAG_PAGE_FRAG is not really needed after pp_frag_count
handling is unified and page_pool_alloc_frag() is supported
in 32-bit arch with 64-bit DMA, so remove it.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
CC: Lorenzo Bianconi <lorenzo@kernel.org>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Liang Chen <liangchen.linux@gmail.com>
CC: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20231020095952.11055-3-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current driver code does not accurately report the supported and
advertised link modes. It basically always assumes the media type
is copper for any particular speed. Utilize the recently added link
mode mappings to accurately report fully qualified ethtool link modes for
advertised and supported speeds.
If the media type is known, we will report the supported link modes for
that media only. If the media is not known, we will report all possible
supported link modes. The user can now specify any supported link modes
(including NRZ and PAM4) to advertise for autoneg. It used to only accept
copper NRZ modes.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Barring the BNXT_FW_TO_ETHTOOL speed macros, which will be removed
in the next patch, update code to use the newer API.
No functional change.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor some NRZ/PAM4 link speed related logic into helper functions.
The NRZ and PAM4 link parameters are stored in separate structure fields.
The driver logic has to check whether it is in NRZ or PAM4 mode and
then use the appropriate field.
Refactor this logic into helper functions for better readability.
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A future patch in this series will change the algorithm used to
determine ethtool speed and media modes. Extract the handling of
the unrelated pause, autoneg modes into an independent function.
Also separate FEC handling out of bnxt_fw_to_ethtool_*_spds().
No functional change.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent kernels support changing the number of link lanes via ethtool.
This is useful for determining the appropriate signal mode to use when
a given link speed can be achieved using different lane configurations.
Accept the ethtool lanes parameter when configuring forced speed. If
there is no lanes parameter, select a default.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add infrastructure to look up the enum ethtool_link_mode_bit_indices
from link information provided by the firmware. The link speed,
signal mode, and media type returned by firmware will be used to
look up the ethtool link mode.
The immediate benefit is that once the link mode is determined, we can
now use ethtool_params_from_link_mode() to fill the basic ethtool
parameters including the number of lanes. Lanes will be fully
supported in the next patch.
Signed-off-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
FW sends the async event to the driver when the device temperature goes
above or below the threshold values. Only notify hwmon if the
temperature is increasing to the next alert level, not when it is
decreasing.
Cc: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Defer hwmon_notify_event() to bnxt_sp_task() workqueue because
hwmon_notify_event() can try to acquire a mutex shown in the stack trace
below. Modify bnxt_event_error_report() to return true if we need to
schedule bnxt_sp_task() to notify hwmon.
__schedule+0x68/0x520
hwmon_notify_event+0xe8/0x114
schedule+0x60/0xe0
schedule_preempt_disabled+0x28/0x40
__mutex_lock.constprop.0+0x534/0x550
__mutex_lock_slowpath+0x18/0x20
mutex_lock+0x5c/0x70
kobject_uevent_env+0x2f4/0x3d0
kobject_uevent+0x10/0x20
hwmon_notify_event+0x94/0x114
bnxt_hwmon_notify_event+0x40/0x70 [bnxt_en]
bnxt_event_error_report+0x260/0x290 [bnxt_en]
bnxt_async_event_process.isra.0+0x250/0x850 [bnxt_en]
bnxt_hwrm_handler.isra.0+0xc8/0x120 [bnxt_en]
bnxt_poll_p5+0x150/0x350 [bnxt_en]
__napi_poll+0x3c/0x210
net_rx_action+0x308/0x3b0
__do_softirq+0x120/0x3e0
Cc: Guenter Roeck <linux@roeck-us.net>
Fixes: a19b480145 ("bnxt_en: Event handler for Thermal event")
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop unneeded error checking.
devlink_fmsg_*() family of functions is now retaining errors,
so there is no need to check for them after each call.
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent FW interface update bumped the size of struct hwrm_func_cfg_input
above 128B which is the max some devices support.
Probe on Stratus (BCM957452) with FW 20.8.3.11 fails with:
bnxt_en ...: Unable to reserve tx rings
bnxt_en ...: 2nd rings reservation failed.
bnxt_en ...: Not enough rings available.
Once probe is fixed other errors pop up:
bnxt_en ...: Failed to set async event completion ring.
This is because __hwrm_send() rejects requests larger than
bp->hwrm_max_ext_req_len with -E2BIG. Since the driver doesn't
actually access any of the new fields, yet, trim the length.
It should be safe.
Similar workaround exists for backing_store_cfg_input.
Although that one mins() to a constant of 256, not 128
we'll effectively use here. Michael explains: "the backing
store cfg command is supported by relatively newer firmware
that will accept 256 bytes at least."
To make debugging easier in the future add a warning
for oversized requests.
Fixes: 754fbf604f ("bnxt_en: Update firmware interface to 1.10.2.171")
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20231016171640.1481493-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Newer versions of firmware will pre-reserve 1 VNIC for every possible
PF and VF function. Update the driver logic to take this into account
when assigning VNICs to the VFs. These pre-reserved VNICs for the
inactive VFs should be subtracted from the global pool before
assigning them to the active VFs.
Not doing so may cause discrepancies that ultimately may cause some VFs to
have insufficient VNICs to support features such as aRFS.
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add these missing settings in the .ndo_set_vf_vlan() method.
Older firmware does not support the TPID setting so check for
proper support.
Remove the unused BNXT_VF_QOS flag.
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Newer FW will send a new async event when it detects that
the chip's temperature has crossed the configured threshold value.
The driver will now notify hwmon and will log a warning message.
Link: https://lore.kernel.org/netdev/20230815045658.80494-13-michael.chan@broadcom.com/
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: linux-hwmon@vger.kernel.org
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement the sysfs attributes directly in the driver for
shutdown threshold temperature and pass an extra attribute group
to the hwmon core when registering the hwmon device.
Link: https://lore.kernel.org/netdev/20230815045658.80494-12-michael.chan@broadcom.com/
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: linux-hwmon@vger.kernel.org
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of hwmon_device_register_with_groups() is deprecated.
Modified the driver to use hwmon_device_register_with_info().
Driver currently exports only temp1_input through hwmon sysfs
interface. But FW has been modified to report more threshold
temperatures and driver want to report them through the
hwmon interface.
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: linux-hwmon@vger.kernel.org
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is in preparation for upcoming patches in the series.
Driver has to expose more threshold temperatures through the
hwmon sysfs interface. More code will be added and do not
want to overload bnxt.c.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: linux-hwmon@vger.kernel.org
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Driver currently does hwmon device register and unregister
in open and close() respectively. As a result, user will not
be able to query hwmon temperature when interface is in
ifdown state.
Enhance it by moving the hwmon register/unregister to the
probe/remove functions.
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The main changes are the additional thermal thresholds in
hwrm_temp_monitor_query_output and the new async event to
report thermal errors.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bnxt_poll_nitroa0() invokes bnxt_rx_pkt() which can run a XDP program
which in turn can return XDP_REDIRECT. bnxt_rx_pkt() is also used by
__bnxt_poll_work() which flushes (xdp_do_flush()) the packets after each
round. bnxt_poll_nitroa0() lacks this feature.
xdp_do_flush() should be invoked before leaving the NAPI callback.
Invoke xdp_do_flush() after a redirect in bnxt_poll_nitroa0() NAPI.
Cc: Michael Chan <michael.chan@broadcom.com>
Fixes: f18c2b77b2 ("bnxt_en: optimized XDP_REDIRECT support")
Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Many small changes across the subystem, some highlights:
- Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma, mthca,
hns, and bnxt_re
- siw now works over tunnel and other netdevs with a MAC address by
removing assumptions about a MAC/GID from the connection manager
- "Doorbell Pacing" for bnxt_re - this is a best effort scheme to allow
userspace to slow down the doorbell rings if the HW gets full
- irdma egress VLAN priority, better QP/WQ sizing
- rxe bug fixes in queue draining and srq resizing
- Support more ethernet speed options in the core layer
- DMABUF support for bnxt_re
- Multi-stage MTT support for erdma to allow much bigger MR registrations
- A irdma fix with a CVE that came in too late to go to -rc, missing
bounds checking for 0 length MRs
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZPEqkAAKCRCFwuHvBreF
YZrNAPoCBfU+VjCKNr2yqF7s52os5ZdBV7Uuh4txHcXWW9H7GAD/f19i2u62fzNu
C27jj4cztemMBb8mgwyxPw/wLg7NLwY=
=pC6k
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Many small changes across the subystem, some highlights:
- Usual driver cleanups in qedr, siw, erdma, hfi1, mlx4/5, irdma,
mthca, hns, and bnxt_re
- siw now works over tunnel and other netdevs with a MAC address by
removing assumptions about a MAC/GID from the connection manager
- "Doorbell Pacing" for bnxt_re - this is a best effort scheme to
allow userspace to slow down the doorbell rings if the HW gets full
- irdma egress VLAN priority, better QP/WQ sizing
- rxe bug fixes in queue draining and srq resizing
- Support more ethernet speed options in the core layer
- DMABUF support for bnxt_re
- Multi-stage MTT support for erdma to allow much bigger MR
registrations
- A irdma fix with a CVE that came in too late to go to -rc, missing
bounds checking for 0 length MRs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (87 commits)
IB/hfi1: Reduce printing of errors during driver shut down
RDMA/hfi1: Move user SDMA system memory pinning code to its own file
RDMA/hfi1: Use list_for_each_entry() helper
RDMA/mlx5: Fix trailing */ formatting in block comment
RDMA/rxe: Fix redundant break statement in switch-case.
RDMA/efa: Fix wrong resources deallocation order
RDMA/siw: Call llist_reverse_order in siw_run_sq
RDMA/siw: Correct wrong debug message
RDMA/siw: Balance the reference of cep->kref in the error path
Revert "IB/isert: Fix incorrect release of isert connection"
RDMA/bnxt_re: Fix kernel doc errors
RDMA/irdma: Prevent zero-length STAG registration
RDMA/erdma: Implement hierarchical MTT
RDMA/erdma: Refactor the storage structure of MTT entries
RDMA/erdma: Renaming variable names and field names of struct erdma_mem
RDMA/hns: Support hns HW stats
RDMA/hns: Dump whole QP/CQ/MR resource in raw
RDMA/irdma: Add missing kernel-doc in irdma_setup_umode_qp()
RDMA/mlx4: Copy union directly
RDMA/irdma: Drop unused kernel push code
...
All callers of build_skb() (*) in bnxt are in NAPI context.
The budget checking is somewhat convoluted because in the shared
completion queue cases Rx packets are discarded by netpoll
by forcing an error (E). But that happens before skb allocation.
Only a call chain starting at __bnxt_poll_work() can lead to
an skb allocation and it checks budget (b).
* bnxt_rx_multi_page_skb
* bnxt_rx_skb
` bp->rx_skb_func
* bnxt_tpa_end
` bnxt_rx_pkt
E bnxt_force_rx_discard
E bnxt_poll_nitroa0
b __bnxt_poll_work
Use napi_build_skb() to take advantage of the skb cache.
In iperf tests with HW-GRO enabled it barely makes a difference
but in cases where HW-GRO is not as effective (or disabled) it
can give even a >10% boost (20.7Gbps -> 23.1Gbps).
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing driver displays the sum of 4 ring counters under ethtool -S.
These counters are in the array bnxt_sw_func_stats. These counters are
summed at the time of ethtool -S and will be lost when the device is reset.
Replace these counters with the new total ring error counters added in the
last patch. These new counters are saved before reset. ethtool -S will
now display the sum of the saved counters plus the current counters.
Link: https://lore.kernel.org/netdev/CACKFLimD-bKmJ1tGZOLYRjWzEwxkri-Mw7iFme1x2Dr0twdCeg@mail.gmail.com/
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20230817231911.165035-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the ring counters are stored in the per ring datastructure.
During reset, all the rings are freed together with the associated
datastructures. As a result, all the ring error counters will be reset
to zero.
Add logic to keep track of the total error counts of all the rings
and save them before reset (including ifdown). The next patch will
display these total ring error counters under ethtool -S.
Link: https://lore.kernel.org/netdev/CACKFLimD-bKmJ1tGZOLYRjWzEwxkri-Mw7iFme1x2Dr0twdCeg@mail.gmail.com/
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20230817231911.165035-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If we are doing a complete reset with irq_re_init set to true in
bnxt_close_nic(), all the ring structures will be freed. New
structures will be allocated in bnxt_open_nic(). The current code
increments rx_resets counter in bnxt_enable_napi() if bnapi->in_reset
is true. In a complete reset, bnapi->in_reset will never be true
since the structure is just allocated.
Increment the rx_resets counter in bnxt_disable_napi() instead. This
will allow us to save all the ring error counters including the
rx_resets counters in the next patch.
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20230817231911.165035-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Convert to use the page pool buffers for the aggregation ring when
running in non-XDP mode. This simplifies the driver and we benefit
from the recycling of pages. Adjust the page pool size to account
for the aggregation ring size.
Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20230817231911.165035-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>