Commit graph

9205 commits

Author SHA1 Message Date
Linus Torvalds
7ce4de1cda RDMA v6.17 merge window pull request
- Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1, rxe,
   erdma, mana_ib
 
 - Prefetch supprot for rxe ODP
 
 - Remove memory window support from hns as new device FW is no longer
   support it
 
 - Remove qib, it is very old and obsolete now, Cornelis wishes to
   restructure the hfi1/qib shared layer
 
 - Fix a race in destroying CQs where we can still end up with work running
   because the work is cancled before the driver stops triggering it
 
 - Improve interaction with namespaces.
    * Follow the devlink namespace for newly spawned RDMA devices
    * Create iopoib net devces in the parent IB device's namespace
    * Allow CAP_NET_RAW checks to pass in user namespaces
 
 - A new flow control scheme for IB MADs to try and avoid queue overflows
   in the network
 
 - Fix 2G message sizes in bnxt_re
 
 - Optimize mkey layout for mlx5 DMABUF
 
 - New "DMA Handle" concept to allow controlling PCI TPH and steering tags
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaIpYoAAKCRCFwuHvBreF
 YUBaAP9Av4O3tX+xV9lpwXqOS6fE34h5KlvULoF+RMtBpkbW6QEAh+e34i3ay3lY
 gQPI3WZV0Vr1lwLv+g8Pyuxt/1JdXQ8=
 =LCBi
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:

 - Various minor code cleanups and fixes for hns, iser, cxgb4, hfi1,
   rxe, erdma, mana_ib

 - Prefetch supprot for rxe ODP

 - Remove memory window support from hns as new device FW is no longer
   support it

 - Remove qib, it is very old and obsolete now, Cornelis wishes to
   restructure the hfi1/qib shared layer

 - Fix a race in destroying CQs where we can still end up with work
   running because the work is cancled before the driver stops
   triggering it

 - Improve interaction with namespaces:
     * Follow the devlink namespace for newly spawned RDMA devices
     * Create iopoib net devces in the parent IB device's namespace
     * Allow CAP_NET_RAW checks to pass in user namespaces

 - A new flow control scheme for IB MADs to try and avoid queue
   overflows in the network

 - Fix 2G message sizes in bnxt_re

 - Optimize mkey layout for mlx5 DMABUF

 - New "DMA Handle" concept to allow controlling PCI TPH and steering
   tags

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
  RDMA/siw: Change maintainer email address
  RDMA/mana_ib: add support of multiple ports
  RDMA/mlx5: Refactor optional counters steering code
  RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr
  IB: Extend UVERBS_METHOD_REG_MR to get DMAH
  RDMA/mlx5: Add DMAH object support
  RDMA/core: Introduce a DMAH object and its alloc/free APIs
  IB/core: Add UVERBS_METHOD_REG_MR on the MR object
  net/mlx5: Add support for device steering tag
  net/mlx5: Expose IFC bits for TPH
  PCI/TPH: Expose pcie_tph_get_st_table_size()
  RDMA/mlx5: Fix incorrect MKEY masking
  RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()
  RDMA/mlx5: remove redundant check on err on return expression
  RDMA/mana_ib: add additional port counters
  RDMA/mana_ib: Fix DSCP value in modify QP
  RDMA/efa: Add CQ with external memory support
  RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpers
  RDMA/uverbs: Add a common way to create CQ with umem
  RDMA/mlx5: Optimize DMABUF mkey page size
  ...
2025-07-31 12:19:55 -07:00
Linus Torvalds
8be4d31cb8 Networking changes for 6.17.
Core & protocols
 ----------------
 
  - Wrap datapath globals into net_aligned_data, to avoid false sharing.
 
  - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container).
 
  - Add SO_INQ and SCM_INQ support to AF_UNIX.
 
  - Add SIOCINQ support to AF_VSOCK.
 
  - Add TCP_MAXSEG sockopt to MPTCP.
 
  - Add IPv6 force_forwarding sysctl to enable forwarding per interface.
 
  - Make TCP validation of whether packet fully fits in the receive
    window and the rcv_buf more strict. With increased use of HW
    aggregation a single "packet" can be multiple 100s of kB.
 
  - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
    improves latency up to 33% for sockmap users.
 
  - Convert TCP send queue handling from tasklet to BH workque.
 
  - Improve BPF iteration over TCP sockets to see each socket exactly once.
 
  - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code.
 
  - Support enabling kernel threads for NAPI processing on per-NAPI
    instance basis rather than a whole device. Fully stop the kernel NAPI
    thread when threaded NAPI gets disabled. Previously thread would stick
    around until ifdown due to tricky synchronization.
 
  - Allow multicast routing to take effect on locally-generated packets.
 
  - Add output interface argument for End.X in segment routing.
 
  - MCTP: add support for gateway routing, improve bind() handling.
 
  - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink.
 
  - Add a new neighbor flag ("extern_valid"), which cedes refresh
    responsibilities to userspace. This is needed for EVPN multi-homing
    where a neighbor entry for a multi-homed host needs to be synced
    across all the VTEPs among which the host is multi-homed.
 
  - Support NUD_PERMANENT for proxy neighbor entries.
 
  - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM.
 
  - Add sequence numbers to netconsole messages. Unregister netconsole's
    console when all net targets are removed. Code refactoring.
    Add a number of selftests.
 
  - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
    should be used for an inbound SA lookup.
 
  - Support inspecting ref_tracker state via DebugFS.
 
  - Don't force bonding advertisement frames tx to ~333 ms boundaries.
    Add broadcast_neighbor option to send ARP/ND on all bonded links.
 
  - Allow providing upcall pid for the 'execute' command in openvswitch.
 
  - Remove DCCP support from Netfilter's conntrack.
 
  - Disallow multiple packet duplications in the queuing layer.
 
  - Prevent use of deprecated iptables code on PREEMPT_RT.
 
 Driver API
 ----------
 
  - Support RSS and hashing configuration over ethtool Netlink.
 
  - Add dedicated ethtool callbacks for getting and setting hashing fields.
 
  - Add support for power budget evaluation strategy in PSE /
    Power-over-Ethernet. Generate Netlink events for overcurrent etc.
 
  - Support DPLL phase offset monitoring across all device inputs.
    Support providing clock reference and SYNC over separate DPLL
    inputs.
 
  - Support traffic classes in devlink rate API for bandwidth management.
 
  - Remove rtnl_lock dependency from UDP tunnel port configuration.
 
 Device drivers
 --------------
 
  - Add a new Broadcom driver for 800G Ethernet (bnge).
 
  - Add a standalone driver for Microchip ZL3073x DPLL.
 
  - Remove IBM's NETIUCV device driver.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
     - support zero-copy Tx of DMABUF memory
     - take page size into account for page pool recycling rings
    - Intel (100G, ice, idpf):
      - idpf: XDP and AF_XDP support preparations
      - idpf: add flow steering
      - add link_down_events statistic
      - clean up the TSPLL code
      - preparations for live VM migration
    - nVidia/Mellanox:
     - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
     - optimize context memory usage for matchers
     - expose serial numbers in devlink info
     - support PCIe congestion metrics
    - Meta (fbnic):
      - add 25G, 50G, and 100G link modes to phylink
      - support dumping FW logs
    - Marvell/Cavium:
      - support for CN20K generation of the Octeon chips
    - Amazon:
      - add HW clock (without timestamping, just hypervisor time access)
 
  - Ethernet virtual:
    - VirtIO net:
      - support segmentation of UDP-tunnel-encapsulated packets
    - Google (gve):
      - support packet timestamping and clock synchronization
    - Microsoft vNIC:
      - add handler for device-originated servicing events
      - allow dynamic MSI-X vector allocation
      - support Tx bandwidth clamping
 
  - Ethernet NICs consumer, and embedded:
    - AMD:
      - amd-xgbe: hardware timestamping and PTP clock support
    - Broadcom integrated MACs (bcmgenet, bcmasp):
      - use napi_complete_done() return value to support NAPI polling
      - add support for re-starting auto-negotiation
    - Broadcom switches (b53):
      - support BCM5325 switches
      - add bcm63xx EPHY power control
    - Synopsys (stmmac):
      - lots of code refactoring and cleanups
    - TI:
      - icssg-prueth: read firmware-names from device tree
      - icssg: PRP offload support
    - Microchip:
      - lan78xx: convert to PHYLINK for improved PHY and MAC management
      - ksz: add KSZ8463 switch support
    - Intel:
      - support similar queue priority scheme in multi-queue and
        time-sensitive networking (taprio)
      - support packet pre-emption in both
    - RealTek (r8169):
      - enable EEE at 5Gbps on RTL8126
    - Airoha:
      - add PPPoE offload support
      - MDIO bus controller for Airoha AN7583
 
  - Ethernet PHYs:
    - support for the IPQ5018 internal GE PHY
    - micrel KSZ9477 switch-integrated PHYs:
      - add MDI/MDI-X control support
      - add RX error counters
      - add cable test support
      - add Signal Quality Indicator (SQI) reporting
    - dp83tg720: improve reset handling and reduce link recovery time
    - support bcm54811 (and its MII-Lite interface type)
    - air_en8811h: support resume/suspend
    - support PHY counters for QCA807x and QCA808x
    - support WoL for QCA807x
 
  - CAN drivers:
    - rcar_canfd: support for Transceiver Delay Compensation
    - kvaser: report FW versions via devlink dev info
 
  - WiFi:
    - extended regulatory info support (6 GHz)
    - add statistics and beacon monitor for Multi-Link Operation (MLO)
    - support S1G aggregation, improve S1G support
    - add Radio Measurement action fields
    - support per-radio RTS threshold
    - some work around how FIPS affects wifi, which was wrong (RC4 is used
      by TKIP, not only WEP)
    - improvements for unsolicited probe response handling
 
  - WiFi drivers:
    - RealTek (rtw88):
      - IBSS mode for SDIO devices
    - RealTek (rtw89):
      - BT coexistence for MLO/WiFi7
      - concurrent station + P2P support
      - support for USB devices RTL8851BU/RTL8852BU
    - Intel (iwlwifi):
      - use embedded PNVM in (to be released) FW images to fix
        compatibility issues
      - many cleanups (unused FW APIs, PCIe code, WoWLAN)
      - some FIPS interoperability
    - MediaTek (mt76):
      - firmware recovery improvements
      - more MLO work
    - Qualcomm/Atheros (ath12k):
      - fix scan on multi-radio devices
      - more EHT/Wi-Fi 7 features
      - encapsulation/decapsulation offload
    - Broadcom (brcm80211):
      - support SDIO 43751 device
 
  - Bluetooth:
    - hci_event: add support for handling LE BIG Sync Lost event
    - ISO: add socket option to report packet seqnum via CMSG
    - ISO: support SCM_TIMESTAMPING for ISO TS
 
  - Bluetooth drivers:
    - intel_pcie: support Function Level Reset
    - nxpuart: add support for 4M baudrate
    - nxpuart: implement powerup sequence, reset, FW dump, and FW loading
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmiFgLgACgkQMUZtbf5S
 IrvafxAAnQRwYBoIG+piCILx6z5pRvBGHkmEQ4AQgSCFuq2eO3ubwMFIqEybfma1
 5+QFjUZAV3OgGgKRBS2KGWxtSzdiF+/JGV1VOIN67sX3Mm0a2QgjA4n5CgKL0FPr
 o6BEzjX5XwG1zvGcBNQ5BZ19xUUKjoZQgTtnea8sZ57Fsp5RtRgmYRqoewNvNk/n
 uImh0NFsDVb0UeOpSzC34VD9l1dJvLGdui4zJAjno/vpvmT1DkXjoK419J/r52SS
 X+5WgsfJ6DkjHqVN1tIhhK34yWqBOcwGFZJgEnWHMkFIl2FqRfFKMHyqtfLlVnLA
 mnIpSyz8Sq2AHtx0TlgZ3At/Ri8p5+yYJgHOXcDKyABa8y8Zf4wrycmr6cV9JLuL
 z54nLEVnJuvfDVDVJjsLYdJXyhMpZFq6+uAItdxKaw8Ugp/QqG4QtoRj+XIHz4ZW
 z6OohkCiCzTwEISFK+pSTxPS30eOxq43kCspcvuLiwCCStJBRkRb5GdZA4dm7LA+
 1Od4ADAkHjyrFtBqTyyC2scX8UJ33DlAIpAYyIeS6w9Cj9EXxtp1z33IAAAZ03MW
 jJwIaJuc8bK2fWKMmiG7ucIXjPo4t//KiWlpkwwqLhPbjZgfDAcxq1AC2TLoqHBL
 y4EOgKpHDCMAghSyiFIAn2JprGcEt8dp+11B0JRXIn4Pm/eYDH8=
 =lqbe
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Wrap datapath globals into net_aligned_data, to avoid false sharing

   - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container)

   - Add SO_INQ and SCM_INQ support to AF_UNIX

   - Add SIOCINQ support to AF_VSOCK

   - Add TCP_MAXSEG sockopt to MPTCP

   - Add IPv6 force_forwarding sysctl to enable forwarding per interface

   - Make TCP validation of whether packet fully fits in the receive
     window and the rcv_buf more strict. With increased use of HW
     aggregation a single "packet" can be multiple 100s of kB

   - Add MSG_MORE flag to optimize large TCP transmissions via sockmap,
     improves latency up to 33% for sockmap users

   - Convert TCP send queue handling from tasklet to BH workque

   - Improve BPF iteration over TCP sockets to see each socket exactly
     once

   - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code

   - Support enabling kernel threads for NAPI processing on per-NAPI
     instance basis rather than a whole device. Fully stop the kernel
     NAPI thread when threaded NAPI gets disabled. Previously thread
     would stick around until ifdown due to tricky synchronization

   - Allow multicast routing to take effect on locally-generated packets

   - Add output interface argument for End.X in segment routing

   - MCTP: add support for gateway routing, improve bind() handling

   - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink

   - Add a new neighbor flag ("extern_valid"), which cedes refresh
     responsibilities to userspace. This is needed for EVPN multi-homing
     where a neighbor entry for a multi-homed host needs to be synced
     across all the VTEPs among which the host is multi-homed

   - Support NUD_PERMANENT for proxy neighbor entries

   - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM

   - Add sequence numbers to netconsole messages. Unregister
     netconsole's console when all net targets are removed. Code
     refactoring. Add a number of selftests

   - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol
     should be used for an inbound SA lookup

   - Support inspecting ref_tracker state via DebugFS

   - Don't force bonding advertisement frames tx to ~333 ms boundaries.
     Add broadcast_neighbor option to send ARP/ND on all bonded links

   - Allow providing upcall pid for the 'execute' command in openvswitch

   - Remove DCCP support from Netfilter's conntrack

   - Disallow multiple packet duplications in the queuing layer

   - Prevent use of deprecated iptables code on PREEMPT_RT

  Driver API:

   - Support RSS and hashing configuration over ethtool Netlink

   - Add dedicated ethtool callbacks for getting and setting hashing
     fields

   - Add support for power budget evaluation strategy in PSE /
     Power-over-Ethernet. Generate Netlink events for overcurrent etc

   - Support DPLL phase offset monitoring across all device inputs.
     Support providing clock reference and SYNC over separate DPLL
     inputs

   - Support traffic classes in devlink rate API for bandwidth
     management

   - Remove rtnl_lock dependency from UDP tunnel port configuration

  Device drivers:

   - Add a new Broadcom driver for 800G Ethernet (bnge)

   - Add a standalone driver for Microchip ZL3073x DPLL

   - Remove IBM's NETIUCV device driver

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support zero-copy Tx of DMABUF memory
         - take page size into account for page pool recycling rings
      - Intel (100G, ice, idpf):
         - idpf: XDP and AF_XDP support preparations
         - idpf: add flow steering
         - add link_down_events statistic
         - clean up the TSPLL code
         - preparations for live VM migration
      - nVidia/Mellanox:
         - support zero-copy Rx/Tx interfaces (DMABUF and io_uring)
         - optimize context memory usage for matchers
         - expose serial numbers in devlink info
         - support PCIe congestion metrics
      - Meta (fbnic):
         - add 25G, 50G, and 100G link modes to phylink
         - support dumping FW logs
      - Marvell/Cavium:
         - support for CN20K generation of the Octeon chips
      - Amazon:
         - add HW clock (without timestamping, just hypervisor time access)

   - Ethernet virtual:
      - VirtIO net:
         - support segmentation of UDP-tunnel-encapsulated packets
      - Google (gve):
         - support packet timestamping and clock synchronization
      - Microsoft vNIC:
         - add handler for device-originated servicing events
         - allow dynamic MSI-X vector allocation
         - support Tx bandwidth clamping

   - Ethernet NICs consumer, and embedded:
      - AMD:
         - amd-xgbe: hardware timestamping and PTP clock support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - use napi_complete_done() return value to support NAPI polling
         - add support for re-starting auto-negotiation
      - Broadcom switches (b53):
         - support BCM5325 switches
         - add bcm63xx EPHY power control
      - Synopsys (stmmac):
         - lots of code refactoring and cleanups
      - TI:
         - icssg-prueth: read firmware-names from device tree
         - icssg: PRP offload support
      - Microchip:
         - lan78xx: convert to PHYLINK for improved PHY and MAC management
         - ksz: add KSZ8463 switch support
      - Intel:
         - support similar queue priority scheme in multi-queue and
           time-sensitive networking (taprio)
         - support packet pre-emption in both
      - RealTek (r8169):
         - enable EEE at 5Gbps on RTL8126
      - Airoha:
         - add PPPoE offload support
         - MDIO bus controller for Airoha AN7583

   - Ethernet PHYs:
      - support for the IPQ5018 internal GE PHY
      - micrel KSZ9477 switch-integrated PHYs:
         - add MDI/MDI-X control support
         - add RX error counters
         - add cable test support
         - add Signal Quality Indicator (SQI) reporting
      - dp83tg720: improve reset handling and reduce link recovery time
      - support bcm54811 (and its MII-Lite interface type)
      - air_en8811h: support resume/suspend
      - support PHY counters for QCA807x and QCA808x
      - support WoL for QCA807x

   - CAN drivers:
      - rcar_canfd: support for Transceiver Delay Compensation
      - kvaser: report FW versions via devlink dev info

   - WiFi:
      - extended regulatory info support (6 GHz)
      - add statistics and beacon monitor for Multi-Link Operation (MLO)
      - support S1G aggregation, improve S1G support
      - add Radio Measurement action fields
      - support per-radio RTS threshold
      - some work around how FIPS affects wifi, which was wrong (RC4 is
        used by TKIP, not only WEP)
      - improvements for unsolicited probe response handling

   - WiFi drivers:
      - RealTek (rtw88):
         - IBSS mode for SDIO devices
      - RealTek (rtw89):
         - BT coexistence for MLO/WiFi7
         - concurrent station + P2P support
         - support for USB devices RTL8851BU/RTL8852BU
      - Intel (iwlwifi):
         - use embedded PNVM in (to be released) FW images to fix
           compatibility issues
         - many cleanups (unused FW APIs, PCIe code, WoWLAN)
         - some FIPS interoperability
      - MediaTek (mt76):
         - firmware recovery improvements
         - more MLO work
      - Qualcomm/Atheros (ath12k):
         - fix scan on multi-radio devices
         - more EHT/Wi-Fi 7 features
         - encapsulation/decapsulation offload
      - Broadcom (brcm80211):
         - support SDIO 43751 device

   - Bluetooth:
      - hci_event: add support for handling LE BIG Sync Lost event
      - ISO: add socket option to report packet seqnum via CMSG
      - ISO: support SCM_TIMESTAMPING for ISO TS

   - Bluetooth drivers:
      - intel_pcie: support Function Level Reset
      - nxpuart: add support for 4M baudrate
      - nxpuart: implement powerup sequence, reset, FW dump, and FW loading"

* tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits)
  dpll: zl3073x: Fix build failure
  selftests: bpf: fix legacy netfilter options
  ipv6: annotate data-races around rt->fib6_nsiblings
  ipv6: fix possible infinite loop in fib6_info_uses_dev()
  ipv6: prevent infinite loop in rt6_nlmsg_size()
  ipv6: add a retry logic in net6_rt_notify()
  vrf: Drop existing dst reference in vrf_ip6_input_dst
  net/sched: taprio: align entry index attr validation with mqprio
  net: fsl_pq_mdio: use dev_err_probe
  selftests: rtnetlink.sh: remove esp4_offload after test
  vsock: remove unnecessary null check in vsock_getname()
  igb: xsk: solve negative overflow of nb_pkts in zerocopy mode
  stmmac: xsk: fix negative overflow of budget in zerocopy mode
  dt-bindings: ieee802154: Convert at86rf230.txt yaml format
  net: dsa: microchip: Disable PTP function of KSZ8463
  net: dsa: microchip: Setup fiber ports for KSZ8463
  net: dsa: microchip: Write switch MAC address differently for KSZ8463
  net: dsa: microchip: Use different registers for KSZ8463
  net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver
  dt-bindings: net: dsa: microchip: Add KSZ8463 switch support
  ...
2025-07-30 08:58:55 -07:00
Linus Torvalds
22c5696e3f Driver core changes for 6.17-rc1
- DEBUGFS
 
   - Remove unneeded debugfs_file_{get,put}() instances
 
   - Remove last remnants of debugfs_real_fops()
 
   - Allow storing non-const void * in struct debugfs_inode_info::aux
 
 - SYSFS
 
   - Switch back to attribute_group::bin_attrs (treewide)
 
   - Switch back to bin_attribute::read()/write() (treewide)
 
   - Constify internal references to 'struct bin_attribute'
 
 - Support cache-ids for device-tree systems
 
   - Add arch hook arch_compact_of_hwid()
 
   - Use arch_compact_of_hwid() to compact MPIDR values on arm64
 
 - Rust
 
   - Device
 
     - Introduce CoreInternal device context (for bus internal methods)
 
     - Provide generic drvdata accessors for bus devices
 
     - Provide Driver::unbind() callbacks
 
     - Use the infrastructure above for auxiliary, PCI and platform
 
     - Implement Device::as_bound()
 
     - Rename Device::as_ref() to Device::from_raw() (treewide)
 
     - Implement fwnode and device property abstractions
 
       - Implement example usage in the Rust platform sample driver
 
   - Devres
 
     - Remove the inner reference count (Arc) and use pin-init instead
 
     - Replace Devres::new_foreign_owned() with devres::register()
 
     - Require T to be Send in Devres<T>
 
     - Initialize the data kept inside a Devres last
 
     - Provide an accessor for the Devres associated Device
 
   - Device ID
 
     - Add support for ACPI device IDs and driver match tables
 
     - Split up generic device ID infrastructure
 
     - Use generic device ID infrastructure in net::phy
 
   - DMA
 
     - Implement the dma::Device trait
 
     - Add DMA mask accessors to dma::Device
 
     - Implement dma::Device for PCI and platform devices
 
     - Use DMA masks from the DMA sample module
 
   - I/O
 
     - Implement abstraction for resource regions (struct resource)
 
     - Implement resource-based ioremap() abstractions
 
     - Provide platform device accessors for I/O (remap) requests
 
   - Misc
 
     - Support fallible PinInit types in Revocable
 
     - Implement Wrapper<T> for Opaque<T>
 
     - Merge pin-init blanket dependencies (for Devres)
 
 - Misc
 
   - Fix OF node leak in auxiliary_device_create()
 
   - Use util macros in device property iterators
 
   - Improve kobject sample code
 
   - Add device_link_test() for testing device link flags
 
   - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
 
   - Hint to prefer container_of_const() over container_of()
 -----BEGIN PGP SIGNATURE-----
 
 iHQEABYKAB0WIQS2q/xV6QjXAdC7k+1FlHeO1qrKLgUCaIjkhwAKCRBFlHeO1qrK
 LpXuAP9RWwfD9ZGgQZ9OsMk/0pZ2mDclaK97jcmI9TAeSxeZMgD1FHnOMTY7oSIi
 iG7Muq0yLD+A5gk9HUnMUnFNrngWCg==
 =jgRj
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core

Pull driver core updates from Danilo Krummrich:
 "debugfs:
   - Remove unneeded debugfs_file_{get,put}() instances
   - Remove last remnants of debugfs_real_fops()
   - Allow storing non-const void * in struct debugfs_inode_info::aux

  sysfs:
   - Switch back to attribute_group::bin_attrs (treewide)
   - Switch back to bin_attribute::read()/write() (treewide)
   - Constify internal references to 'struct bin_attribute'

  Support cache-ids for device-tree systems:
   - Add arch hook arch_compact_of_hwid()
   - Use arch_compact_of_hwid() to compact MPIDR values on arm64

  Rust:
   - Device:
       - Introduce CoreInternal device context (for bus internal methods)
       - Provide generic drvdata accessors for bus devices
       - Provide Driver::unbind() callbacks
       - Use the infrastructure above for auxiliary, PCI and platform
       - Implement Device::as_bound()
       - Rename Device::as_ref() to Device::from_raw() (treewide)
       - Implement fwnode and device property abstractions
       - Implement example usage in the Rust platform sample driver
   - Devres:
       - Remove the inner reference count (Arc) and use pin-init instead
       - Replace Devres::new_foreign_owned() with devres::register()
       - Require T to be Send in Devres<T>
       - Initialize the data kept inside a Devres last
       - Provide an accessor for the Devres associated Device
   - Device ID:
       - Add support for ACPI device IDs and driver match tables
       - Split up generic device ID infrastructure
       - Use generic device ID infrastructure in net::phy
   - DMA:
       - Implement the dma::Device trait
       - Add DMA mask accessors to dma::Device
       - Implement dma::Device for PCI and platform devices
       - Use DMA masks from the DMA sample module
   - I/O:
       - Implement abstraction for resource regions (struct resource)
       - Implement resource-based ioremap() abstractions
       - Provide platform device accessors for I/O (remap) requests
   - Misc:
       - Support fallible PinInit types in Revocable
       - Implement Wrapper<T> for Opaque<T>
       - Merge pin-init blanket dependencies (for Devres)

  Misc:
   - Fix OF node leak in auxiliary_device_create()
   - Use util macros in device property iterators
   - Improve kobject sample code
   - Add device_link_test() for testing device link flags
   - Fix typo in Documentation/ABI/testing/sysfs-kernel-address_bits
   - Hint to prefer container_of_const() over container_of()"

* tag 'driver-core-6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: (84 commits)
  rust: io: fix broken intra-doc links to `platform::Device`
  rust: io: fix broken intra-doc link to missing `flags` module
  rust: io: mem: enable IoRequest doc-tests
  rust: platform: add resource accessors
  rust: io: mem: add a generic iomem abstraction
  rust: io: add resource abstraction
  rust: samples: dma: set DMA mask
  rust: platform: implement the `dma::Device` trait
  rust: pci: implement the `dma::Device` trait
  rust: dma: add DMA addressing capabilities
  rust: dma: implement `dma::Device` trait
  rust: net::phy Change module_phy_driver macro to use module_device_table macro
  rust: net::phy represent DeviceId as transparent wrapper over mdio_device_id
  rust: device_id: split out index support into a separate trait
  device: rust: rename Device::as_ref() to Device::from_raw()
  arm64: cacheinfo: Provide helper to compress MPIDR value into u32
  cacheinfo: Add arch hook to compress CPU h/w id into 32 bits for cache-id
  cacheinfo: Set cache 'id' based on DT data
  container_of: Document container_of() is not to be used in new code
  driver core: auxiliary bus: fix OF node leak
  ...
2025-07-29 12:15:39 -07:00
Konstantin Taranov
60c9a34df2 RDMA/mana_ib: add support of multiple ports
If the HW indicates support of multiple ports for rdma, create an IB device
with a port per netdev in the ethernet mana driver.

CM is only available on port 1, but RC QPs are supported on all
ports.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1753174515-23634-1-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23 04:31:11 -04:00
Patrisious Haddad
10d4de4189 RDMA/mlx5: Refactor optional counters steering code
Currently there isn't a clear layer separation between the counters and
the steering code, whereas the steering code is doing redundant access
to the counter struct.

Separate the fs.c and counters.c, where fs code won't access or be
aware of counter structs but only the objects it needs.

As a result, move mlx5_rdma_counter struct from the header file to be
an internal struct for the counters file only.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/9854d1fdb140e4a6552b7a2fd1a028cfe488a935.1753004208.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23 03:38:57 -04:00
Yishai Hadas
e1bed9a94d RDMA/mlx5: Add DMAH support for reg_user_mr/reg_user_dmabuf_mr
As part of this enhancement, allow the creation of an MKEY associated
with a DMA handle.

Additional notes:

MKEYs with TPH (i.e. TLP Processing Hints) attributes are currently not
UMR-capable unless explicitly enabled by firmware or hardware.
Therefore, to maintain such MKEYs in the MR cache, the TPH fields have
been added to the rb_key structure, with a dedicated hash bucket.

The ability to bypass the kernel verbs flow and create an MKEY with TPH
attributes using DEVX has been restricted. TPH must follow the standard
InfiniBand flow, where a DMAH is created with the appropriate security
checks and management mechanisms in place.

DMA handles are currently not supported in conjunction with On-Demand
Paging (ODP).

Re-registration of memory regions originally created with TPH attributes
is currently not supported.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/1c485651cf8417694ddebb80446c5093d5a791a9.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23 01:42:11 -04:00
Yishai Hadas
a272019a46 IB: Extend UVERBS_METHOD_REG_MR to get DMAH
Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers.

It will be used in mlx5 driver as part of the next patch from the
series.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23 01:42:11 -04:00
Yishai Hadas
3c81907075 RDMA/mlx5: Add DMAH object support
This patch introduces support for allocating and deallocating the DMAH
object.

Further details:
----------------
The DMAH API is exposed to upper layers only if the underlying device
supports TPH.

It uses the mlx5_core steering tag (ST) APIs to get a steering tag index
based on the provided input.

The obtained index is stored in the device-specific mlx5_dmah structure
for future use.

Upcoming patches in the series will integrate the allocated DMAH into
the memory region (MR) registration process.

Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/778550776799d82edb4d05da249a1cff00160b50.1752752567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23 01:42:10 -04:00
Leon Romanovsky
b834407368 RDMA/mlx5: Fix incorrect MKEY masking
mkey_mask is __be64 type, while MLX5_MKEY_MASK_PAGE_SIZE is declared as
unsigned long long. This causes to the static checkers errors:

drivers/infiniband/hw/mlx5/umr.c:663:49: warning: invalid assignment: &=
drivers/infiniband/hw/mlx5/umr.c:663:49:    left side has type restricted __be64
drivers/infiniband/hw/mlx5/umr.c:663:49:    right side has type int

Fixes: e73242aa14 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Link: https://patch.msgid.link/e354d70b98dfa5ecf4c236a36cd36b64add9d9de.1753003467.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-21 02:27:52 -04:00
Leon Romanovsky
d59ebb4549 RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()
As Colin reported:
 "The variable zapped_blocks is a size_t type and is being assigned a int
  return value from the call to _mlx5r_umr_zap_mkey. Since zapped_blocks is an
  unsigned type, the error check for zapped_blocks < 0 will never be true."

So separate return error and nblocks assignment.

Fixes: e73242aa14 ("RDMA/mlx5: Optimize DMABUF mkey page size")
Reported-by: Colin King (gmail) <colin.i.king@gmail.com>
Closes: https://lore.kernel.org/all/79166fb1-3b73-4d37-af02-a17b22eb8e64@gmail.com
Link: https://patch.msgid.link/71d8ea208ac7eaa4438af683b9afaed78625e419.1753003467.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-21 02:27:40 -04:00
Colin Ian King
aee80e6ffc RDMA/mlx5: remove redundant check on err on return expression
Currently all paths that set err and then check it for an error
perform immediate returns, hence err always zero at the end of
the function _mlx5r_umr_zap_mkey.  The return expression
err ? err : nblocks has a redundant check on the err since err
is always zero, so just return nblocks instead.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://patch.msgid.link/20250717112108.4036171-1-colin.i.king@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-17 08:19:19 -04:00
Andy Gospodarek
c34632dbb2 bnxt: move bnxt_hsi.h to include/linux/bnxt/hsi.h
This moves bnxt_hsi.h contents to a common location so it can be
properly referenced by bnxt_en, bnxt_re, and bnge.

Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20250714170202.39688-1-gospo@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-15 16:03:24 -07:00
Jakub Kicinski
2f4053db0b Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:

====================
mlx5-next updates 2025-07-14

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: IFC updates for disabled host PF
  net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc
  RDMA/mlx5: Fix UMR modifying of mkey page size
  net/mlx5: Expose HCA capability bits for mkey max page size
====================

Link: https://patch.msgid.link/1752481357-34780-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14 17:26:57 -07:00
Zhiyue Qiu
084f35b84f RDMA/mana_ib: add additional port counters
Add packet and request port counters to mana_ib.

Signed-off-by: Zhiyue Qiu <zhiyueqiu@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1752143395-5324-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 04:24:11 -04:00
Shiraz Saleem
62de0e6732 RDMA/mana_ib: Fix DSCP value in modify QP
Convert the traffic_class in GRH to a DSCP value as required by the HW.

Fixes: e095405b45 ("RDMA/mana_ib: Modify QP state")
Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com>
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://patch.msgid.link/1752143085-4169-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 04:23:47 -04:00
Michael Margolin
9fb3dd8519 RDMA/efa: Add CQ with external memory support
Add an option to create CQ using external memory instead of allocating
in the driver. The memory can be passed from userspace by dmabuf fd and
an offset or a VA. One of the possible usages is creating CQs that
reside in accelerator memory, allowing low latency asynchronous direct
polling from the accelerator device. Add a capability bit to reflect on
the feature support.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250708202308.24783-4-mrgolin@amazon.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 04:00:34 -04:00
Edward Srouji
e73242aa14 RDMA/mlx5: Optimize DMABUF mkey page size
The current implementation of DMABUF memory registration uses a fixed
page size for the memory key (mkey), which can lead to suboptimal
performance when the underlying memory layout may offer better page
size.

The optimization improves performance by reducing the number of page
table entries required for the mkey, leading to less MTT/KSM descriptors
that the HCA must go through to find translations, fewer cache-lines,
and shorter UMR work requests on mkey updates such as when
re-registering or reusing a cacheable mkey.

To ensure safe page size updates, the implementation uses a 5-step
process:
1. Make the first X entries non-present, while X is calculated to be
   minimal according to a large page shift that can be used to cover the
   MR length.
2. Update the page size to the large supported page size
3. Load the remaining N-X entries according to the (optimized)
   page shift
4. Update the page size according to the (optimized) page shift
5. Load the first X entries with the correct translations

This ensures that at no point is the MR accessible with a partially
updated translation table, maintaining correctness and preventing
access to stale or inconsistent mappings, such as having an mkey
advertising the new page size while some of the underlying page table
entries still contain the old page size translations.

Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/bc05a6b2142c02f96a90635f9a4458ee4bbbf39f.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 03:14:19 -04:00
Michael Guralnik
fcfb03597b RDMA/mlx5: Align mkc page size capability check to PRM
Align the capabilities checked when using the log_page_size 6th bit in the
mkey context to the PRM definition. The upper and lower bounds are set by
max/min caps, and modification of the 6th bit by UMR is allowed only when
a specific UMR cap is set.
Current implementation falsely assumes all page sizes up-to 2^63 are
supported when the UMR cap is set. In case the upper bound cap is lower
than 63, this might result a FW syndrome on mkey creation, e.g:
mlx5_core 0000:c1:00.0: mlx5_cmd_out_err:832:(pid 0): CREATE_MKEY(0×200) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0×38a711), err(-22)

Previous cap enforcement is still correct for all current HW, FW and
driver combinations. However, this patch aligns the code to be PRM
compliant in the general case.

Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/eab4eeb4785105a4bb5eb362dc0b3662cd840412.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 03:14:14 -04:00
Leon Romanovsky
9879bddf5a Optimize DMABUF mkey page size in mlx5
From Edward:

This patch series enables the mlx5 driver to dynamically choose the
optimal page size for a DMABUF-based memory key (mkey), rather than
always registering with a fixed page size.

Previously, DMABUF memory registration used a fixed 4K page size for
mkeys which could lead to suboptimal performance when the underlying
memory layout may offer better page sizes.

This approach did not take advantage of larger page size capabilities
advertised by the HCA, and the driver was not setting the proper page
size mask in the mkey mask when performing page size changes,
potentially
leading to invalid registrations when updating to a very large pages.

This series improves DMABUF performance by dynamically selecting the
best page size for a given memory region (MR) both at creation time and
on page fault occurrences, based on the underlying layout and fixing
related gaps and bugs.

By doing so, we reduce the number of page table entries (and thus MTT/
KSM descriptors) that the HCA must traverse, which in turn reduces
cache-line fetches.

Thanks

* mlx5-next:
  RDMA/mlx5: Fix UMR modifying of mkey page size
  net/mlx5: Expose HCA capability bits for mkey max page size

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 03:00:21 -04:00
Edward Srouji
c4f96972c3 RDMA/mlx5: Fix UMR modifying of mkey page size
When changing the page size on an mkey, the driver needs to set the
appropriate bits in the mkey mask to indicate which fields are being
modified.
The 6th bit of a page size in mlx5 driver is considered an extension,
and this bit has a dedicated capability and mask bits.

Previously, the driver was not setting this mask in the mkey mask when
performing page size changes, regardless of its hardware support,
potentially leading to an incorrect page size updates.

This fixes the issue by setting the relevant bit in the mkey mask when
performing page size changes on an mkey and the 6th bit of this field is
supported by the hardware.

Fixes: cef7dde883 ("net/mlx5: Expand mkey page size to support 6 bits")
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13 02:57:30 -04:00
Al Viro
2b4b80cfcf hfi1: get rid of redundant debugfs_file_{get,put}()
All files in question are created via debugfs_create_file(), so
exclusion with removals is provided by debugfs wrappers; as the matter
of fact, hfi1-private wrappers had been redundant at least since 2017...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702211508.GB3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09 13:30:28 +02:00
Basel Nassar
475ac071ba RDMA/efa: Add Network HW statistics counters
Update device API and request network counters. Expose newly added
counters through ib core counters mechanism.

Reviewed-by: David Shoolman <shoolman@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Basel Nassar <baselna@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://patch.msgid.link/20250706070740.22534-1-mrgolin@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09 02:55:53 -04:00
Jakub Kicinski
80b0dd1c4e Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Tariq Toukan says:

====================
mlx5-next updates 2025-07-08

The following pull-request contains common mlx5 updates
for your *net-next* tree.

v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com

* 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
  net/mlx5: Check device memory pointer before usage
  net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
  net/mlx5: Add IFC bits for PCIe Congestion Event object
  net/mlx5: Small refactor for general object capabilities
  net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain
====================

Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08 16:59:57 -07:00
Leon Romanovsky
f3b7a65ce5 Merge branch 'mlx5-next' into wip/leon-for-next
* mlx5-next:
  net/mlx5: Check device memory pointer before usage
  net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow
  net/mlx5: Add IFC bits for PCIe Congestion Event object
  net/mlx5: Small refactor for general object capabilities
2025-07-07 06:29:16 -04:00
Kalesh AP
7788278ff2 RDMA/bnxt_re: Use macro instead of hard coded value
1. Defined a macro for the hard coded value.
2. "access" field in the request structure is of type "u8".
   Updated the mask accordingly.

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Link: https://patch.msgid.link/20250704043857.19158-4-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
Selvin Xavier
0aed817380 RDMA/bnxt_re: Support 2G message size
bnxt_qplib_put_sges is calculating the length in
a signed int. So handling the 2G message size
is not working since it is considered as negative.

Use a unsigned number to calculate the total message
length. As per the spec, IB message size shall be
between zero and 2^31 bytes (inclusive). So adding
a check for 2G message size.

Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Link: https://patch.msgid.link/20250704043857.19158-3-kalesh-anakkur.purayil@broadcom.com
Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
Kalesh AP
09d231ab56 RDMA/bnxt_re: Fix size of uverbs_copy_to() in BNXT_RE_METHOD_GET_TOGGLE_MEM
Since both "length" and "offset" are of type u32, there is
no functional issue here.

Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Signed-off-by: Shravya KN <shravya.k-n@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20250704043857.19158-2-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
Junxian Huang
79d56805c5 RDMA/hns: Fix -Wframe-larger-than issue
Fix -Wframe-larger-than issue by allocating memory for qpc struct
with kzalloc() instead of using stack memory.

Fixes: 606bf89e98 ("RDMA/hns: Refactor for hns_roce_v2_modify_qp function")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506240032.CSgIyFct-lkp@intel.com/
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
Junxian Huang
5338abb299 RDMA/hns: Drop GFP_NOWARN
GFP_NOWARN silences all warnings on dma_alloc_coherent() failure,
which might otherwise help with troubleshooting.

Fixes: 9a4435375c ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
Junxian Huang
278c18a4a7 RDMA/hns: Fix accessing uninitialized resources
hr_dev->pgdir_list and hr_dev->pgdir_mutex won't be initialized if
CQ/QP record db are not enabled, but they are also needed when using
SRQ with SRQ record db enabled. Simplified the logic by always
initailizing the reosurces.

Fixes: c9813b0b99 ("RDMA/hns: Support SRQ record doorbell")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
Junxian Huang
2c2ec0106c RDMA/hns: Get message length of ack_req from FW
ACK_REQ_FREQ indicates the number of packets (after MTU fragmentation)
HW sends before setting an ACK request. When MTU is greater than or
equal to 1024, the current ACK_REQ_FREQ value causes HW to request an
ACK for every MTU fragment. The processing of a large number of ACKs
severely impacts HW performance when sending large size payloads.

Get message length of ack_req from FW so that we can adjust this
parameter according to different situations. There are several
constraints for ACK_REQ_FREQ:

1. mtu * (2 ^ ACK_REQ_FREQ) should not be too large, otherwise it may
   cause some unexpected retries when sending large payload.

2. ACK_REQ_FREQ should be larger than or equal to LP_PKTN_INI.

3. ACK_REQ_FREQ must be equal to LP_PKTN_INI when using LDCP
   or HC3 congestion control algorithm.

Fixes: 56518a603f ("RDMA/hns: Modify the value of long message loopback slice")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07 01:37:35 -04:00
wenglianfa
998b41cb20 RDMA/hns: Fix HW configurations not cleared in error flow
hns_roce_clear_extdb_list_info() will eventually do some HW
configurations through FW, and they need to be cleared by
calling hns_roce_function_clear() when the initialization
fails.

Fixes: 7e78dd816e ("RDMA/hns: Clear extended doorbell info before using")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-06 02:46:02 -04:00
wenglianfa
c6957b95ec RDMA/hns: Fix double destruction of rsv_qp
rsv_qp may be double destroyed in error flow, first in free_mr_init(),
and then in hns_roce_exit(). Fix it by moving the free_mr_init() call
into hns_roce_v2_init().

list_del corruption, ffff589732eb9b50->next is LIST_POISON1 (dead000000000100)
WARNING: CPU: 8 PID: 1047115 at lib/list_debug.c:53 __list_del_entry_valid+0x148/0x240
...
Call trace:
 __list_del_entry_valid+0x148/0x240
 hns_roce_qp_remove+0x4c/0x3f0 [hns_roce_hw_v2]
 hns_roce_v2_destroy_qp_common+0x1dc/0x5f4 [hns_roce_hw_v2]
 hns_roce_v2_destroy_qp+0x22c/0x46c [hns_roce_hw_v2]
 free_mr_exit+0x6c/0x120 [hns_roce_hw_v2]
 hns_roce_v2_exit+0x170/0x200 [hns_roce_hw_v2]
 hns_roce_exit+0x118/0x350 [hns_roce_hw_v2]
 __hns_roce_hw_v2_init_instance+0x1c8/0x304 [hns_roce_hw_v2]
 hns_roce_hw_v2_reset_notify_init+0x170/0x21c [hns_roce_hw_v2]
 hns_roce_hw_v2_reset_notify+0x6c/0x190 [hns_roce_hw_v2]
 hclge_notify_roce_client+0x6c/0x160 [hclge]
 hclge_reset_rebuild+0x150/0x5c0 [hclge]
 hclge_reset+0x10c/0x140 [hclge]
 hclge_reset_subtask+0x80/0x104 [hclge]
 hclge_reset_service_task+0x168/0x3ac [hclge]
 hclge_service_task+0x50/0x100 [hclge]
 process_one_work+0x250/0x9a0
 worker_thread+0x324/0x990
 kthread+0x190/0x210
 ret_from_fork+0x10/0x18

Fixes: fd8489294d ("RDMA/hns: Fix Use-After-Free of rsv_qp on HIP08")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20250703113905.3597124-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-06 02:46:02 -04:00
Stav Aviram
70f238c902 net/mlx5: Check device memory pointer before usage
Add a NULL check before accessing device memory to prevent a crash if
dev->dm allocation in mlx5_init_once() fails.

Fixes: c9b9dcb430 ("net/mlx5: Move device memory management to mlx5_core")
Signed-off-by: Stav Aviram <saviram@nvidia.com>
Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02 14:08:23 -04:00
Thomas Fourier
1db50f7b7a Fix dma_unmap_sg() nents value
The dma_unmap_sg() functions should be called with the same nents as the
dma_map_sg(), not the value the map function returned.

Fixes: ed10435d35 ("RDMA/erdma: Implement hierarchical MTT")
Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com>
Link: https://patch.msgid.link/20250630092346.81017-2-fourier.thomas@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02 05:11:44 -04:00
Parav Pandit
bd82467f17 RDMA/mlx5: Check CAP_NET_RAW in user namespace for devx create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the devx object.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: a8b92ca1b0 ("IB/mlx5: Introduce DEVX")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/36ee87e92defd81410c6a2b33f9d6c0d6dcfd64c.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02 05:11:44 -04:00
Parav Pandit
14957e8125 RDMA/mlx5: Check CAP_NET_RAW in user namespace for anchor create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the anchor.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 0c6ab0ca9a ("RDMA/mlx5: Expose steering anchor to userspace")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/c2376ca75e7658e2cbd1f619cf28fbe98c906419.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01 05:21:39 -04:00
Parav Pandit
95a89ec304 RDMA/mlx5: Check CAP_NET_RAW in user namespace for flow create
Currently, the capability check is done in the default
init_user_ns user namespace. When a process runs in a
non default user namespace, such check fails. Due to this
when a process is running using Podman, it fails to create
the flow.

Since the RDMA device is a resource within a network namespace,
use the network namespace associated with the RDMA device to
determine its owning user namespace.

Fixes: 3226944124 ("IB/mlx5: Introduce driver create and destroy flow methods")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Link: https://patch.msgid.link/a4dcd5e3ac6904ef50b19e56942ca6ab0728794c.1750963874.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01 05:21:33 -04:00
Mark Bloch
611d08207d RDMA/mlx5: Allocate IB device with net namespace supplied from core dev
Use the new ib_alloc_device_with_net() API to allocate the IB device
so that it is properly bound to the network namespace obtained via
mlx5_core_net(). This change ensures correct namespace association
(e.g., for containerized setups).

Additionally, expose mlx5_core_net so that RDMA driver can use it.

Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26 08:10:07 -04:00
Yury Norov [NVIDIA]
b61cc1891c RDMI: hfi1: drop cpumask_empty() call in hfi1/affinity.c
In few places, the driver tests a cpumask for emptiness immediately
before calling functions that report emptiness themself.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-8-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:52 -04:00
Yury Norov [NVIDIA]
3ad8fb8afd RDMA: hfi1: simplify hfi1_get_proc_affinity()
The function protects the for loop with affinity->num_core_siblings > 0
condition, which is redundant because the loop will break immediately in
that case.

Drop it and save one indentation level.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-7-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:48 -04:00
Yury Norov [NVIDIA]
4ea9f618d7 RDMA: hfi1: use rounddown in find_hw_thread_mask()
num_cores_per_socket is calculated by dividing by
node_affinity.num_online_nodes, but all users of this variable multiply
it by node_affinity.num_online_nodes again. This effectively is the same
as rounding it down by node_affinity.num_online_nodes.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-6-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:44 -04:00
Yury Norov [NVIDIA]
59ae2e3c6a RDMA: hfi1: simplify init_real_cpu_mask()
The function opencodes cpumask_nth() and cpumask_clear_cpus(). The
dedicated helpers are easier to use and  usually much faster than
opencoded for-loops.

While there, drop useless clear of real_cpu_mask.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-5-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:39 -04:00
Yury Norov [NVIDIA]
15b0536313 RDMA: hfi1: simplify find_hw_thread_mask()
The function opencodes cpumask_nth() and cpumask_clear_cpus(). The
dedicated helpers are easier to use and usually much faster than
opencoded for-loops.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-4-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:35 -04:00
Yury Norov [NVIDIA]
59f7d21385 RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()
The function divides number of online CPUs by num_core_siblings, and
later checks the divider by zero. This implies a possibility to get
and divide-by-zero runtime error. Fix it by moving the check prior to
division. This also helps to save one indentation level.

Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Link: https://patch.msgid.link/20250604193947.11834-3-yury.norov@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26 05:19:30 -04:00
Patrisious Haddad
40852c8901 RDMA/mlx5: Add multiple priorities support to RDMA TRANSPORT userspace tables
Support the creation of RDMA TRANSPORT tables over multiple priorities
via matcher creation.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/bb38e50ae4504e979c6568d41939402a4cf15635.1750148083.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25 04:00:33 -04:00
Mark Zhang
b5eeb8365d RDMA/mlx5: Support driver APIs pre_destroy_cq and post_destroy_cq
- pre_destroy_cq: Destroy FW CQ object so that no new CQ event would
                  be generated;
- post_destroy_cq: Release all resources.

This patch, along with last one, fixes the crash below.

 Unable to handle kernel paging request at virtual address ffff8000114b1180
 Mem abort info:
   ESR = 0x96000047
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000047
   CM = 0, WnR = 1
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f4582000
 [ffff8000114b1180] pgd=00000447fffff003, p4d=00000447fffff003, pud=00000447ffffe003, pmd=00000447ffffb003, pte=0000000000000000
 Internal error: Oops: 96000047 [#1] SMP
 Modules linked in: udp_diag uio_pci_generic uio tcp_diag inet_diag binfmt_misc sn_core_odd(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) kpatch_9658536(OK) kpatch_9322385(OK) kpatch_8843421(OK) kpatch_8636216(OK) vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sm4_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt sg acpi_ipmi ipmi_si ipmi_msghandler m1_uncore_ddrss_pmu m1_uncore_cmn_pmu team_yosemite9rc6(OE) vnic(OE) ip_tables mlx5_ib(OE) sd_mod ast mlx5_core(OE) i2c_algo_bit drm_vram_helper psample drm_kms_helper mlxdevm(OE) auxiliary(OE) mlxfw(OE) syscopyarea sysfillrect tls sysimgblt fb_sys_fops drm_ttm_helper nvme ttm nvme_core drm t10_pi i2c_designware_platform i2c_designware_core i2c_core ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ipmi_devintf]
 CPU: 83 PID: 59375 Comm: kworker/u253:1 Kdump: loaded Tainted: G           OE K   5.10.84-004.ali5000.alios7.aarch64 #1
 Hardware name: Inspur AliServer-Xuanwu2.0AM-02-2UM1P-5B/AS1221MG1, BIOS 1.2.M1.AL.P.158.00 08/31/2023
 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
 pstate: 82c00089 (Nzcv daIf +PAN +UAO +TCO BTYPE=--)
 pc : native_queued_spin_lock_slowpath+0x1c4/0x31c
 lr : mlx5_ib_poll_cq+0x18c/0x2f8 [mlx5_ib]
 sp : ffff80002be1bc80
 x29: ffff80002be1bc80 x28: ffff000810e69000
 x27: ffff000810e69000 x26: ffff000810e69200
 x25: 0000000000000000 x24: ffff8000117db000
 x23: ffff04000156b780 x22: 0000000000000000
 x21: ffff04000ce6c160 x20: ffff0008196a4000
 x19: 0000000000000010 x18: 0000000000000020
 x17: 0000000000000000 x16: 0000000000000000
 x15: ffff040055a364e8 x14: ffffffffffffffff
 x13: ffff80002318bda8 x12: ffff0400358836e8
 x11: 0000000000000040 x10: 0000000000000eb0
 x9 : 0000000000000000 x8 : 0000000000000000
 x7 : ffff04477fa20140 x6 : ffff8000114b1140
 x5 : ffff04477fa20140 x4 : ffff8000114b1180
 x3 : ffff000810e69200 x2 : ffff8000114b1180
 x1 : 0000000001500000 x0 : ffff04477fa20148
 Call trace:
  native_queued_spin_lock_slowpath+0x1c4/0x31c
  __ib_process_cq+0x74/0x1b8 [ib_core]
  ib_cq_poll_work+0x34/0xa0 [ib_core]
  process_one_work+0x1d8/0x4b0
  worker_thread+0x180/0x440
  kthread+0x114/0x120
 Code: 910020e0 8b0400c4 f862d929 aa0403e2 (f8296847)
 ---[ end trace 387be2290557729c ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: disabled
 CPU features: 0x9850817,7a60aa38
 Memory Limit: none
 Starting crashdump kernel...
 Bye!

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://patch.msgid.link/aaf0072f350d1c7e8731f43b79e11a560bafb9e0.1750070205.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25 03:49:56 -04:00
Patrisious Haddad
a9a9e68954 RDMA/mlx5: Fix vport loopback for MPV device
Always enable vport loopback for both MPV devices on driver start.

Previously in some cases related to MPV RoCE, packets weren't correctly
executing loopback check at vport in FW, since it was disabled.
Due to complexity of identifying such cases for MPV always enable vport
loopback for both GVMIs when binding the slave to the master port.

Fixes: 0042f9e458 ("RDMA/mlx5: Enable vport loopback when user context or QP mandate")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/d4298f5ebb2197459e9e7221c51ecd6a34699847.1750064969.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25 03:41:59 -04:00
Patrisious Haddad
acd245b1e3 RDMA/mlx5: Fix CC counters query for MPV
In case, CC counters are querying for the second port use the correct
core device for the query instead of always using the master core device.

Fixes: aac4492ef2 ("IB/mlx5: Update counter implementation for dual port RoCE")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/9cace74dcf106116118bebfa9146d40d4166c6b0.1750064969.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25 03:41:55 -04:00
Patrisious Haddad
3cc1dbfddf RDMA/mlx5: Fix HW counters query for non-representor devices
To get the device HW counters, a non-representor switchdev device
should use the mlx5_ib_query_q_counters() function and query all of
the available counters. While a representor device in switchdev mode
should use the mlx5_ib_query_q_counters_vport() function and query only
the Q_Counters without the PPCNT counters and congestion control counters,
since they aren't relevant for a representor device.

Currently a non-representor switchdev device skips querying the PPCNT
counters and congestion control counters, leaving them unupdated.
Fix that by properly querying those counters for non-representor devices.

Fixes: d22467a71e ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/56bf8af4ca8c58e3fb9f7e47b1dca2009eeeed81.1750064969.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25 03:41:50 -04:00