Commit graph

1241 commits

Author SHA1 Message Date
Ben Skeggs
0c6aa94f99 drm/nouveau/gsp: switch to a simpler GSP-RM header layout
Rather than using OpenRM's directory structure for headers, move to a
layout that's split roughly around RM API boundaries.

Also move the headers from include/nvrm to subdev/gsp/rm/r535/nvrm,
with the rest of the r535-specific code.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
c472d82834 drm/nouveau/gsp: move subdev/engine impls to subdev/gsp/rm/r535/
Move all the remaining GSP-RM code together underneath a versioned path,
to make the code easier to work with when adding support for a newer RM
version.

Aside from adjusting include paths, no code change is intended.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
594766ca3e drm/nouveau/gsp: move booter handling to GPU-specific code
GH100/GBxxx have significant changes to the GSP-RM boot process.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
7f022236b5 drm/nouveau/gsp: move firmware loading to GPU-specific code
GH100/GBxxx use a slightly different set of firmwares to boot GSP-RM.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
f964336483 drm/nouveau/gsp: split device handling out on its own
Split handling of NV01_DEVICE (and other related objects) out into its
own module.

Aside from moving the function pointers, no code change is intended.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
45a78c6405 drm/nouveau/gsp: split client handling out on its own
Split NV01_ROOT handling out into its own module.

Aside from moving the function pointers, no code change is intended.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
be33f49980 drm/nouveau/gsp: split rm alloc handling out on its own
Split base RM_ALLOC handling out into its own module.

Aside from moving the function pointers, no code change is intended.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
063d193f12 drm/nouveau/gsp: split rm ctrl handling out on its own
Split base RM_CONTROL handling out into its own module.

Aside from moving the function pointers, no code change is intended.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
8a8b1ec526 drm/nouveau/gsp: split rpc handling out on its own
Later patches in the series add HALs around various RM APIs in order to
support a newer version of GSP-RM firmware.  In order to do this, begin
by splitting the code up into "modules" that roughly represent RM's API
boundaries so they can be more easily managed.

Aside from moving the RPC function pointers, no code change is indended.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:23 +10:00
Ben Skeggs
b8a90901db drm/nouveau/gsp: remove gsp-specific chid allocation path
In order to specify a channel ID to RM during channel allocation, the
channel ID is broken down into a "userd page" index and an index into
that page.

It was assumed that RM would enforce that the same physical block of
memory be used for all CHIDs within a "userd page", and the GSP paths
override NVKM's normal CHID allocation to handle this.

However, none of that turns out to be necessary.

Remove the GSP-specific code and use the regular CHID allocation path.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:22 +10:00
Ben Skeggs
7904bcdcf6 drm/nouveau/gsp: fix rm shutdown wait condition
Though the initial upstreamed GSP-RM version in nouveau was 535.113.01,
the code was developed against earlier versions.

535.42.02 modified the mailbox value used by GSP-RM to signal shutdown
has completed, which was missed at the time.

I'm not aware of any issues caused by this, but noticed the bug while
working on GB20x support.

Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Reviewed-by: Dave Airlie <airlied@redhat.com>
Reviewed-by: Timur Tabi <ttabi@nvidia.com>
Tested-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-05-19 06:29:22 +10:00
Zhi Wang
a738fa9105 drm/nouveau/nvkm: introduce new GSP reply policy NVKM_GSP_RPC_REPLY_POLL
Some GSP RPC commands need a new reply policy: "caller don't care about
the message content but want to make sure a reply is received". To
support this case, a new reply policy is introduced.

NV_VGPU_MSG_FUNCTION_ALLOC_MEMORY is a large GSP RPC command. The actual
required policy is NVKM_GSP_RPC_REPLY_POLL. This can be observed from the
dump of the GSP message queue. After the large GSP RPC command is issued,
GSP will write only an empty RPC header in the queue as the reply.

Without this change, the policy "receiving the entire message" is used
for NV_VGPU_MSG_FUNCTION_ALLOC_MEMORY. This causes the timeout of receiving
the returned GSP message in the suspend/resume path.

Introduce the new reply policy NVKM_GSP_RPC_REPLY_POLL, which waits for
the returned GSP message but discards it for the caller. Use the new policy
NVKM_GSP_RPC_REPLY_POLL on the GSP RPC command
NV_VGPU_MSG_FUNCTION_ALLOC_MEMORY.

Fixes: 50f290053d ("drm/nouveau: support handling the return of large GSP message")
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250227013554.8269-3-zhiw@nvidia.com
2025-03-09 13:42:10 +01:00
Zhi Wang
4570355f8e drm/nouveau/nvkm: factor out current GSP RPC command policies
There can be multiple cases of handling the GSP RPC messages, which are
the reply of GSP RPC commands according to the requirement of the
callers and the nature of the GSP RPC commands.

The current supported reply policies are "callers don't care" and "receive
the entire message" according to the requirement of the callers. To
introduce a new policy, factor out the current RPC command reply polices.
Also, centralize the handling of the reply in a single function.

Factor out NVKM_GSP_RPC_REPLY_NOWAIT as "callers don't care" and
NVKM_GSP_RPC_REPLY_RECV as "receive the entire message". Introduce a
kernel doc to document the policies. Factor out
r535_gsp_rpc_handle_reply().

No functional change is intended for small GSP RPC commands. For large GSP
commands, the caller decides the policy of how to handle the returned GSP
RPC message.

Cc: Ben Skeggs <bskeggs@nvidia.com>
Cc: Alexandre Courbot <acourbot@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250227013554.8269-2-zhiw@nvidia.com
2025-03-09 13:41:59 +01:00
Dave Airlie
fb51bf0255 Linux 6.14-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAme7hfkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGU+IH/1bk6zIvAwXXS5yu
 KNsQ8dEkC3Xme6HqLtPsAhRLF+5YJf6MaGm1ip5dDMyIvasa2gwvCQQQoOpeMbKj
 79VKT+m9t3szMHZaQYjlOuYHBmNSJ4cMCD2Qh6ktXHGPfTTWDFGf7fBwBOkVNeJU
 1Ask+bxeop21aJMhfYXrUta3OYyerLBUR6jCiCM82A/GLtdv6oNGXBu3ygDt9Tjx
 ZHSl+CYjKpmGUP8JnMKwCBHVguEfqgzZ//dY1H16AvOLed9k2jkMFn8O5Vi3vjnx
 TWMMXoiJimuamGzbjxtCCqzxNlFFDT4gRpDqeJxb16W/gDTFmbRr9LDjNehCZe33
 AigLZ6M=
 =Y/7F
 -----END PGP SIGNATURE-----

Merge tag 'v6.14-rc4' into drm-next

Backmerge Linux 6.14-rc4 at the request of tzimmermann so misc-next
can base on rc4.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-02-25 17:36:09 +10:00
Dan Carpenter
fdee05235a drm/nouveau: Fix error pointer dereference in r535_gsp_msgq_recv()
If "rpc" is an error pointer then return directly.  Otherwise it leads
to an error pointer dereference.

Fixes: 50f290053d ("drm/nouveau: support handling the return of large GSP message")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/b7052ac0-98e4-433b-ad58-f563bf51858c@stanley.mountain
2025-02-19 14:49:03 +01:00
Aaron Kling
3dbc0215e3 drm/nouveau/pmu: Fix gp10b firmware guard
Most kernel configs enable multiple Tegra SoC generations, causing this
typo to go unnoticed. But in the case where a kernel config is strictly
for Tegra186, this is a problem.

Fixes: 989863d7cb ("drm/nouveau/pmu: select implementation based on available firmware")
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250218-nouveau-gm10b-guard-v2-1-a4de71500d48@gmail.com
2025-02-19 13:31:59 +01:00
Zhi Wang
24079ed2aa drm/nouveau: consume the return of large GSP message
As the GSP message recv path is able to handle the return of large GSP
message, consume the return of large GSP message in the sending path.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-16-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
50f290053d drm/nouveau: support handling the return of large GSP message
The max GSP message element size is 16 pages (including the headers). To
send a message larger than 16 pages, nvkm should split it into multiple
and send them accordingly. The first element has the expected function
number, while the rest are sent with function number as
NV_VGPU_MSG_FUNCTION_CONTINUATION_RECORD. GSP consumes the elements from
the cmdq and always writes the result back to the msgq. The result is also
formed as split elements.

However, nvkm is able to split the large GSP message and send them, but
totally not aware of handling the return of the large GSP message, which
are the split elements in the msgq. Thus, it keeps dumping the unknown RPC
messages from msgq, which is actually CONTINUATION_RECORD message,
discard them unexpectedly. Thus, the caller will not be able to consume
the result from GSP.

Introduce the handling of the return of large GSP message on the msgq path.
Slightly re-factor the low-level part of msg receiving routines. Merge the
split elements back into a large element before handling it to the upper
level. Thus, the upper-level of GSP RPC APIs don't need to be heavily
changed.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-15-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
3c48ecb38a drm/nouveau: factor out r535_gsp_msgq_recv_one_elem()
Prepare for supporting receive the large GSP RPC message.

Factor out r535_gsp_msgq_recv_one_elem(). Fold its params into a data
structure of params. Move the allocation of the GSP RPC message to its
caller. Refine the variable names in the re-factor.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-14-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
c965e3598b drm/nouveau: factor out r535_gsp_msgq_peek()
To receive a GSP message queue element from the GSP status queue, the
driver needs to make sure there are available elements in the queue.

The previous r535_gsp_msgq_wait() consists of three functions, which is
a little too complicated for a single function:
- wait for an available element.
- peek the message element header in the queue.
- recevice the element from the queue.

Factor out r535_gsp_msgq_peek() and divide the functions in
r535_gsp_msgq_wait() into three functions.

No functional change is intended.

Cc: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-13-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
1829ee0b05 drm/nouveau: rename the variable "cmd" to "msg" in r535_gsp_cmdq_{get, push}()
Refine the name to align with the terms in the kernel doc.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-12-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
4624450452 drm/nouveau: refine the variable names in r535_gsp_msg_recv()
The variable "msg" in r535_gsp_msg_recv() actually means the GSP RPC.

Refine the names to align with the terms in the kernel doc.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-11-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
0268040b9c drm/nouveau: refine the variable names in r535_gsp_rpc_push()
The variable names in r535_gsp_rpc_push() are quite confusing and some
of them are not representing what they really are.

Update the names and explanations in the decoder section of the
kernel doc. Refine the names to align with the terms in the kernel doc.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-10-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
1bb9bb50a4 drm/nouveau: remove the magic number in r535_gsp_rpc_push()
There has been a GSP_MSG_MAX_SIZE which represents the max size of a GSP
message element header. Use it instead of a magic number.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-9-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
bbae6680cf drm/nouveau: fix the broken marco GSP_MSG_MAX_SIZE
The macro GSP_MSG_MAX_SIZE refers to another macro that doesn't exist.
It represents the max GSP message element size.

Fix the broken marco so it can be used to replace some magic numbers in
the code.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-8-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
bda6fe811f drm/nouveau: rename "argc" to what it represents in GSP RPC routines
The name "argc" has different meanings in different functions.

To improve the readability, it's better to refine it to a name that
reflects what it represents.

Rename "argc" to what it represents. Add terms in the decoder section to
explain their meaning.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
[ Fix indentation. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-7-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
0c2f211b66 drm/nouveau: rename "argv" to what it represents in *rm_{alloc, ctrl}_*()
The name "argv" has different meanings in different functions.

To improve the readability, it's better to refine it to a name that
reflects what it represents.

Rename "argv" to what it represents. Wrap the long container_of() into
to_payload_header() to denote a clear meaning and make checkpatch.pl
happy.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-6-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
a15b537976 drm/nouveau: remove unused param repc in *rm_alloc_push()
The user of *rm_alloc_push() always pass 0 in repc.

Remove unused param repc since no user actually uses it.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-5-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
2c6a79af3f drm/nouveau: rename "argv" to what it represents on the GSP message send path
The name "argv" has different meanings in different functions.

To improve the readability, it's better to refine it to a name that
reflects what it represents.

Rename "repc" to what it represents in the GSP message send path.
Wrap the long container_of() into to_gsp_hdr() to make checkpatch.pl
happy.

No functional change is intended.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-4-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
f98ed88eb9 drm/nouveau: rename "repc" to "gsp_rpc_len" on the GSP message recv path
The name "repc" has different meanings in different contexts.

To improve the readability, it's better to refine it to a name that
reflects what it actually represents.

Rename "repc" to "gsp_rpc_len" in the GSP message recv path. Add an
section in the doc to explain the terms.

No functional change is intended.

Cc: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-3-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Zhi Wang
22807d30fa drm/nouveau: add a kernel doc to introduce the GSP RPC
In order to explain the name clean-ups in GSP RPC routines, a kernel
doc to explain the memory layout and terms is required.

Add a kernel doc to introduce the GSP RPC.

Cc: Danilo Krummrich <dakr@kernel.org>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
[ Fix bullet list indentation; add SPDX-License-Identifier. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250124182958.2040494-2-zhiw@nvidia.com
2025-01-25 00:55:10 +01:00
Timur Tabi
97395ce76e drm/nouveau: fix kernel-doc comments
Fix some malformed kernel-doc comments that were added in a recent commit.

Also, kernel-doc does not support global variables, so change those
kernel-doc comments into regular comments.

Fixes: 214c9539cf ("drm/nouveau: expose GSP-RM logging buffers via debugfs")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412310834.jtCJj4oz-lkp@intel.com/
Signed-off-by: Timur Tabi <ttabi@nvidia.com>
Reviewed-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20250108234329.842256-1-ttabi@nvidia.com
2025-01-09 20:06:33 +01:00
Timur Tabi
214c9539cf drm/nouveau: expose GSP-RM logging buffers via debugfs
The LOGINIT, LOGINTR, LOGRM, and LOGPMU buffers are circular buffers
that have printf-like logs from GSP-RM and PMU encoded in them.

LOGINIT, LOGINTR, and LOGRM are allocated by Nouveau and their DMA
addresses are passed to GSP-RM during initialization. The buffers are
required for GSP-RM to initialize properly.

LOGPMU is also allocated by Nouveau, but its contents are updated
when Nouveau receives an NV_VGPU_MSG_EVENT_UCODE_LIBOS_PRINT RPC from
GSP-RM. Nouveau then copies the RPC to the buffer.

The messages are encoded as an array of variable-length structures that
contain the parameters to an NV_PRINTF call. The format string and
parameter count are stored in a special ELF image that contains only
logging strings. This image is not currently shipped with the Nvidia
driver.

There are two methods to extract the logs.

OpenRM tries to load the logging ELF, and if present, parses the log
buffers in real time and outputs the strings to the kernel console.

Alternatively, and this is the method used by this patch, the buffers
can be exposed to user space, and a user-space tool (along with the
logging ELF image) can parse the buffer and dump the logs.

This method has the advantage that it allows the buffers to be parsed
even when the logging ELF file is not available to the user. However,
it has the disadvantage the debugfs entries need to remain until the
driver is unloaded.

The buffers are exposed via debugfs. If GSP-RM fails to initialize, then
Nouveau immediately shuts down the GSP interface. This would normally
also deallocate the logging buffers, thereby preventing the user from
capturing the debug logs.

To avoid this, introduce the keep-gsp-logging command line parameter. If
specified, and if at least one logging buffer has content, then Nouveau
will migrate these buffers into new debugfs entries that are retained
until the driver unloads.

An end-user can capture the logs using the following commands:

    cp /sys/kernel/debug/nouveau/<path>/loginit loginit
    cp /sys/kernel/debug/nouveau/<path>/logrm logrm
    cp /sys/kernel/debug/nouveau/<path>/logintr logintr
    cp /sys/kernel/debug/nouveau/<path>/logpmu logpmu

where (for a PCI device) <path> is the PCI ID of the GPU (e.g.
0000:65:00.0).

Since LOGPMU is not needed for normal GSP-RM operation, it is only
created if debugfs is available. Otherwise, the
NV_VGPU_MSG_EVENT_UCODE_LIBOS_PRINT RPCs are ignored.

A simple way to test the buffer migration feature is to have
nvkm_gsp_init() return an error code.

Tested-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241030202952.694055-2-ttabi@nvidia.com
2024-12-04 21:47:53 +01:00
Timur Tabi
7c995e2fd9 drm/nouveau: retain device pointer in nvkm_gsp_mem object
Store the struct device pointer used to allocate the DMA buffer in
the nvkm_gsp_mem object.  This allows nvkm_gsp_mem_dtor() to release
the buffer without needing the nvkm_gsp.  This is needed so that
we can retain DMA buffers even after the nvkm_gsp object is deleted.

Signed-off-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241030202952.694055-1-ttabi@nvidia.com
2024-12-04 21:47:47 +01:00
Maxime Ripard
3aba2eba84
Merge drm/drm-next into drm-misc-next
Kickstart 6.14 cycle.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2024-12-02 12:44:18 +01:00
Zhi Wang
01ed662bdd nvkm: correctly calculate the available space of the GSP cmdq buffer
r535_gsp_cmdq_push() waits for the available page in the GSP cmdq
buffer when handling a large RPC request. When it sees at least one
available page in the cmdq, it quits the waiting with the amount of
free buffer pages in the queue.

Unfortunately, it always takes the [write pointer, buf_size) as
available buffer pages before rolling back and wrongly calculates the
size of the data should be copied. Thus, it can overwrite the RPC
request that GSP is currently reading, which causes GSP hang due
to corrupted RPC request:

[  549.209389] ------------[ cut here ]------------
[  549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[  549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
[  549.225752]  iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
[  549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G            E      6.9.0-rc6+ #1
[  549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
[  549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[  549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c
[  549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246
[  549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28
[  549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730
[  549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70
[  549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020
[  549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000
[  549.406988] FS:  00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000
[  549.415076] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0
[  549.427963] PKRU: 55555554
[  549.430672] Call Trace:
[  549.433126]  <TASK>
[  549.435233]  ? __warn+0x7f/0x130
[  549.438473]  ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[  549.443426]  ? report_bug+0x18a/0x1a0
[  549.447098]  ? handle_bug+0x3c/0x70
[  549.450589]  ? exc_invalid_op+0x14/0x70
[  549.454430]  ? asm_exc_invalid_op+0x16/0x20
[  549.458619]  ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm]
[  549.463565]  r535_gsp_msg_recv+0x46/0x230 [nvkm]
[  549.468257]  r535_gsp_rpc_push+0x106/0x160 [nvkm]
[  549.473033]  r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm]
[  549.478422]  nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm]
[  549.483899]  nvidia_grid_init+0xb1/0xd0 [nvkm]
[  549.488420]  ? srso_alias_return_thunk+0x5/0xfbef5
[  549.493213]  nvkm_device_pci_probe+0x305/0x420 [nvkm]
[  549.498338]  local_pci_probe+0x46/0xa0
[  549.502096]  pci_call_probe+0x56/0x170
[  549.505851]  pci_device_probe+0x79/0xf0
[  549.509690]  ? driver_sysfs_add+0x59/0xc0
[  549.513702]  really_probe+0xd9/0x380
[  549.517282]  __driver_probe_device+0x78/0x150
[  549.521640]  driver_probe_device+0x1e/0x90
[  549.525746]  __driver_attach+0xd2/0x1c0
[  549.529594]  ? __pfx___driver_attach+0x10/0x10
[  549.534045]  bus_for_each_dev+0x78/0xd0
[  549.537893]  bus_add_driver+0x112/0x210
[  549.541750]  driver_register+0x5c/0x120
[  549.545596]  ? __pfx_nvkm_init+0x10/0x10 [nvkm]
[  549.550224]  do_one_initcall+0x44/0x300
[  549.554063]  ? do_init_module+0x23/0x240
[  549.557989]  do_init_module+0x64/0x240

Calculate the available buffer page before rolling back based on
the result from the waiting.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017071922.2518724-3-zhiw@nvidia.com
2024-11-25 17:36:07 +01:00
Zhi Wang
8d9beb4aeb nvkm/gsp: correctly advance the read pointer of GSP message queue
A GSP event message consists three parts: message header, RPC header,
message body. GSP calculates the number of pages to write from the
total size of a GSP message. This behavior can be observed from the
movement of the write pointer.

However, nvkm takes only the size of RPC header and message body as
the message size when advancing the read pointer. When handling a
two-page GSP message in the non rollback case, It wrongly takes the
message body of the previous message as the message header of the next
message. As the "message length" tends to be zero, in the calculation of
size needs to be copied (0 - size of (message header)), the size needs to
be copied will be "0xffffffxx". It also triggers a kernel panic due to a
NULL pointer error.

[  547.614102] msg: 00000f90: ff ff ff ff ff ff ff ff 40 d7 18 fb 8b 00 00 00  ........@.......
[  547.622533] msg: 00000fa0: 00 00 00 00 ff ff ff ff ff ff ff ff 00 00 00 00  ................
[  547.630965] msg: 00000fb0: ff ff ff ff ff ff ff ff 00 00 00 00 ff ff ff ff  ................
[  547.639397] msg: 00000fc0: ff ff ff ff 00 00 00 00 ff ff ff ff ff ff ff ff  ................
[  547.647832] nvkm 0000:c1:00.0: gsp: peek msg rpc fn:0 len:0x0/0xffffffffffffffe0
[  547.655225] nvkm 0000:c1:00.0: gsp: get msg rpc fn:0 len:0x0/0xffffffffffffffe0
[  547.662532] BUG: kernel NULL pointer dereference, address: 0000000000000020
[  547.669485] #PF: supervisor read access in kernel mode
[  547.674624] #PF: error_code(0x0000) - not-present page
[  547.679755] PGD 0 P4D 0
[  547.682294] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  547.686643] CPU: 22 PID: 322 Comm: kworker/22:1 Tainted: G            E      6.9.0-rc6+ #1
[  547.694893] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022
[  547.702626] Workqueue: events r535_gsp_msgq_work [nvkm]
[  547.707921] RIP: 0010:r535_gsp_msg_recv+0x87/0x230 [nvkm]
[  547.713375] Code: 00 8b 70 08 48 89 e1 31 d2 4c 89 f7 e8 12 f5 ff ff 48 89 c5 48 85 c0 0f 84 cf 00 00 00 48 81 fd 00 f0 ff ff 0f 87 c4 00 00 00 <8b> 55 10 41 8b 46 30 85 d2 0f 85 f6 00 00 00 83 f8 04 76 10 ba 05
[  547.732119] RSP: 0018:ffffabe440f87e10 EFLAGS: 00010203
[  547.737335] RAX: 0000000000000010 RBX: 0000000000000008 RCX: 000000000000003f
[  547.744461] RDX: 0000000000000000 RSI: ffffabe4480a8030 RDI: 0000000000000010
[  547.751585] RBP: 0000000000000010 R08: 0000000000000000 R09: ffffabe440f87bb0
[  547.758707] R10: ffffabe440f87dc8 R11: 0000000000000010 R12: 0000000000000000
[  547.765834] R13: 0000000000000000 R14: ffff9351df1e5000 R15: 0000000000000000
[  547.772958] FS:  0000000000000000(0000) GS:ffff93708eb00000(0000) knlGS:0000000000000000
[  547.781035] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  547.786771] CR2: 0000000000000020 CR3: 00000003cc220002 CR4: 0000000000770ef0
[  547.793896] PKRU: 55555554
[  547.796600] Call Trace:
[  547.799046]  <TASK>
[  547.801152]  ? __die+0x20/0x70
[  547.804211]  ? page_fault_oops+0x75/0x170
[  547.808221]  ? print_hex_dump+0x100/0x160
[  547.812226]  ? exc_page_fault+0x64/0x150
[  547.816152]  ? asm_exc_page_fault+0x22/0x30
[  547.820341]  ? r535_gsp_msg_recv+0x87/0x230 [nvkm]
[  547.825184]  r535_gsp_msgq_work+0x42/0x50 [nvkm]
[  547.829845]  process_one_work+0x196/0x3d0
[  547.833861]  worker_thread+0x2fc/0x410
[  547.837613]  ? __pfx_worker_thread+0x10/0x10
[  547.841885]  kthread+0xdf/0x110
[  547.845031]  ? __pfx_kthread+0x10/0x10
[  547.848775]  ret_from_fork+0x30/0x50
[  547.852354]  ? __pfx_kthread+0x10/0x10
[  547.856097]  ret_from_fork_asm+0x1a/0x30
[  547.860019]  </TASK>
[  547.862208] Modules linked in: nvkm(E) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) mlx5_ib(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) mxm_wmi(E) joydev(E) rapl(E) ptdma(E) i2c_piix4(E) acpi_cpufreq(E) wmi_bmof(E) pcspkr(E) k10temp(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) drm_shmem_helper(E) crct10dif_pclmul(E) drm_kms_helper(E) ahci(E) crc32_pclmul(E) nvme_tcp(E) libahci(E) nvme(E) crc32c_intel(E) nvme_fabrics(E) cdc_ether(E) nvme_core(E) usbnet(E) mlx5_core(E) ghash_clmulni_intel(E) drm(E) libata(E) ccp(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E)
[  547.862283]  iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)]
[  547.962691] CR2: 0000000000000020
[  547.966003] ---[ end trace 0000000000000000 ]---
[  549.012012] clocksource: Long readout interval, skipping watchdog check: cs_nsec: 1370499158 wd_nsec: 1370498904
[  549.043676] pstore: backend (erst) writing error (-28)
[  549.050924] RIP: 0010:r535_gsp_msg_recv+0x87/0x230 [nvkm]
[  549.056389] Code: 00 8b 70 08 48 89 e1 31 d2 4c 89 f7 e8 12 f5 ff ff 48 89 c5 48 85 c0 0f 84 cf 00 00 00 48 81 fd 00 f0 ff ff 0f 87 c4 00 00 00 <8b> 55 10 41 8b 46 30 85 d2 0f 85 f6 00 00 00 83 f8 04 76 10 ba 05
[  549.075138] RSP: 0018:ffffabe440f87e10 EFLAGS: 00010203
[  549.080361] RAX: 0000000000000010 RBX: 0000000000000008 RCX: 000000000000003f
[  549.087484] RDX: 0000000000000000 RSI: ffffabe4480a8030 RDI: 0000000000000010
[  549.094609] RBP: 0000000000000010 R08: 0000000000000000 R09: ffffabe440f87bb0
[  549.101733] R10: ffffabe440f87dc8 R11: 0000000000000010 R12: 0000000000000000
[  549.108857] R13: 0000000000000000 R14: ffff9351df1e5000 R15: 0000000000000000
[  549.115982] FS:  0000000000000000(0000) GS:ffff93708eb00000(0000) knlGS:0000000000000000
[  549.124061] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  549.129807] CR2: 0000000000000020 CR3: 00000003cc220002 CR4: 0000000000770ef0
[  549.136940] PKRU: 55555554
[  549.139653] Kernel panic - not syncing: Fatal exception
[  549.145054] Kernel Offset: 0x18c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  549.165074] ---[ end Kernel panic - not syncing: Fatal exception ]---

Also, nvkm wrongly advances the read pointer when handling a two-page GSP
message in the rollback case. In the rollback case, the GSP message will
be copied in two rounds. When handling a two-page GSP message, nvkm first
copies amount of (GSP_PAGE_SIZE - header) data into the buffer, then
advances the read pointer by the result of DIV_ROUND_UP(size,
GSP_PAGE_SIZE). Thus, the read pointer is advanced by 1.

Next, nvkm copies the amount of (total size - (GSP_PAGE_SIZE -
header)) data into the buffer. The left amount of the data will be always
larger than one page since the message header is not taken into account
in the first copy. Thus, the read pointer is advanced by DIV_ROUND_UP(
size(larger than one page), GSP_PAGE_SIZE) = 2.

In the end, the read pointer is wrongly advanced by 3 when handling a
two-page GSP message in the rollback case.

Fix the problems by taking the total size of the message into account
when advancing the read pointer and calculate the read pointer in the end
of the all copies for the rollback case.

BTW: the two-page GSP message can be observed in the msgq when vGPU is
enabled.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241017071922.2518724-2-zhiw@nvidia.com
2024-11-25 17:36:06 +01:00
Linus Torvalds
28eb75e178 drm for 6.13-rc1
core:
 - split DSC helpers from DP helpers
 - clang build fixes for drm/mm test
 - drop simple pipeline support for gem vram
 - document submission error signaling
 - move drm_rect to drm core module from kms helper
 - add default client setup to most drivers
 - move to video aperture helpers instead of drm ones
 
 tests:
 - new framebuffer tests
 
 ttm:
 - remove swapped and pinned BOs from TTM lru
 
 panic:
 - fix uninit spinlock
 - add ABGR2101010 support
 
 bridge:
 - add TI TDP158 support
 - use standard PM OPS
 
 dma-fence:
 - use read_trylock instead of read_lock to help lockdep
 
 scheduler:
 - add errno to sched start to report different errors
 - add locking to drm_sched_entity_modify_sched
 - improve documentation
 
 xe:
 - add drm_line_printer
 - lots of refactoring
 - Enable Xe2 + PES disaggregation
 - add new ARL PCI ID
 - SRIOV development work
 - fix exec unnecessary implicit fence
 - define and parse OA sync props
 - forcewake refactoring
 
 i915:
 - Enable BMG/LNL ultra joiner
 - Enable 10bpx + CCS scanout on ICL+, fp16/CCS on TGL+
 - use DSB for plane/color mgmt
 - Arrow lake PCI IDs
 - lots of i915/xe display refactoring
 - enable PXP GuC autoteardown
 - Pantherlake (PTL) Xe3 LPD display enablement
 - Allow fastset HDR infoframe changes
 - write DP source OUI for non-eDP sinks
 - share PCI IDs between i915 and xe
 
 amdgpu:
 - SDMA queue reset support
 - SMU 13.0.6, JPEG 4.0.3 updates
 - Initial runtime repartitioning support
 - rework IP structs for multiple IP instances
 - Fetch EDID from _DDC if available
 - SMU13 zero rpm user control
 - lots of fixes/cleanups
 
 amdkfd:
 - Increase event FIFO size
 - add topology cap flag for per queue reset
 
 msm:
 - DPU:
 - SA8775P support
 - (disabled by default) MSM8917, MSM8937, MSM8953 and MSM8996 support
 - Enable large framebuffer support
 - Drop MSM8998 and SDM845
 - DP:
 - SA8775P support
 - GPU:
 - a7xx preemption support
 - Adreno A663 support
 
 ast:
 - warn about unsupported TX chips
 
 ivpu:
 - add coredump
 - add pantherlake support
 
 rockchip:
 - 4K@60Hz display enablement
 - generate pll programming tables
 
 panthor:
 - add timestamp query API
 - add realtime group priority
 - add fdinfo support
 
 etnaviv:
 - improve handling of DMA address limits
 - improve GPU hangcheck
 
 exynos:
 - Decon Exynos7870 support
 
 mediatek:
 - add OF graph support
 
 omap:
 - locking fixes
 
 bochs:
 - convert to gem/shmem from simpledrm
 
 v3d:
 - support big/super pages
 - add gemfs
 
 vc4:
 - BCM2712 support refactoring
 - add YUV444 format support
 
 udmabuf:
 - folio related fixes
 
 nouveau:
 - add panic support on nv50+
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmc+efwACgkQDHTzWXnE
 hr6Dyg/9HVVI3lxuWAz9MEt3w+BON5KTJAxg5Zhvc5DwiUbDXghu8sfkUfanDWS5
 /MqyPqLt5srXrtKTRDnzEI0Vf8YHeiDEcaydjpshEpCfteHZ7SADpvem8fp6/otV
 iYt8U6tMcGe9I+M2kwDkOTrKJIiyCKPi5hfBIAkxEAh6806ifPRtLkeMGbaSwBxH
 x6kZTE9ygGWAY7bAgbmVmm3JwrXG9mYDl9dW3cbi9gZ6PGAXHPZRUPvZoHhvfC2A
 UVgROH76Spm4rdWYGI3azj+gW3HsdGgUHcysb+lu37i261E+sT7kuV2UYtnOMzr5
 igO1RlQ+rcfPYLG4n+oNXDMu5d1OQXELrlQzXptym4Konpd7b/GSeVctWV0wHWuv
 nG8g7DWAFFnLAdeWqLZpf1Brze33h5+572D3BioWB4LYSEATjwoTwcBKsdRuc4Wk
 RHxjumCidybTdo/8EB1ElGlH39m/mDQA0scMlVhS/BuiIssfgcBRfltI8S3HzHcW
 YQYq6xH7F9E3shs3/TYbWR4clm66ZTnZV6ClDfGJolzyF/hbV0rsbeSpDelpooE8
 1Js7KuwVa+HvA4jtupY9vqxMTdXWwoGPfuUgKpOAreYibnd1T9Q1zVme/B1bUH05
 518IjiMGCxDnBvFWaPT9DcX4zg7pS3yzjw3hGkdz3reUqat0Gy8=
 =8cUI
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2024-11-21' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "There's a lot of rework, the panic helper support is being added to
  more drivers, v3d gets support for HW superpages, scheduler
  documentation, drm client and video aperture reworks, some new
  MAINTAINERS added, amdgpu has the usual lots of IP refactors, Intel
  has some Pantherlake enablement and xe is getting some SRIOV bits, but
  just lots of stuff everywhere.

  core:
   - split DSC helpers from DP helpers
   - clang build fixes for drm/mm test
   - drop simple pipeline support for gem vram
   - document submission error signaling
   - move drm_rect to drm core module from kms helper
   - add default client setup to most drivers
   - move to video aperture helpers instead of drm ones

  tests:
   - new framebuffer tests

  ttm:
   - remove swapped and pinned BOs from TTM lru

  panic:
   - fix uninit spinlock
   - add ABGR2101010 support

  bridge:
   - add TI TDP158 support
   - use standard PM OPS

  dma-fence:
   - use read_trylock instead of read_lock to help lockdep

  scheduler:
   - add errno to sched start to report different errors
   - add locking to drm_sched_entity_modify_sched
   - improve documentation

  xe:
   - add drm_line_printer
   - lots of refactoring
   - Enable Xe2 + PES disaggregation
   - add new ARL PCI ID
   - SRIOV development work
   - fix exec unnecessary implicit fence
   - define and parse OA sync props
   - forcewake refactoring

  i915:
   - Enable BMG/LNL ultra joiner
   - Enable 10bpx + CCS scanout on ICL+, fp16/CCS on TGL+
   - use DSB for plane/color mgmt
   - Arrow lake PCI IDs
   - lots of i915/xe display refactoring
   - enable PXP GuC autoteardown
   - Pantherlake (PTL) Xe3 LPD display enablement
   - Allow fastset HDR infoframe changes
   - write DP source OUI for non-eDP sinks
   - share PCI IDs between i915 and xe

  amdgpu:
   - SDMA queue reset support
   - SMU 13.0.6, JPEG 4.0.3 updates
   - Initial runtime repartitioning support
   - rework IP structs for multiple IP instances
   - Fetch EDID from _DDC if available
   - SMU13 zero rpm user control
   - lots of fixes/cleanups

  amdkfd:
   - Increase event FIFO size
   - add topology cap flag for per queue reset

  msm:
   - DPU:
      - SA8775P support
      - (disabled by default) MSM8917, MSM8937, MSM8953 and MSM8996 support
      - Enable large framebuffer support
      - Drop MSM8998 and SDM845
   - DP:
      - SA8775P support
   - GPU:
      - a7xx preemption support
      - Adreno A663 support

  ast:
   - warn about unsupported TX chips

  ivpu:
   - add coredump
   - add pantherlake support

  rockchip:
   - 4K@60Hz display enablement
   - generate pll programming tables

  panthor:
   - add timestamp query API
   - add realtime group priority
   - add fdinfo support

  etnaviv:
   - improve handling of DMA address limits
   - improve GPU hangcheck

  exynos:
   - Decon Exynos7870 support

  mediatek:
   - add OF graph support

  omap:
   - locking fixes

  bochs:
   - convert to gem/shmem from simpledrm

  v3d:
   - support big/super pages
   - add gemfs

  vc4:
   - BCM2712 support refactoring
   - add YUV444 format support

  udmabuf:
   - folio related fixes

  nouveau:
   - add panic support on nv50+"

* tag 'drm-next-2024-11-21' of https://gitlab.freedesktop.org/drm/kernel: (1583 commits)
  drm/xe/guc: Fix dereference before NULL check
  drm/amd: Fix initialization mistake for NBIO 7.7.0
  Revert "drm/amd/display: parse umc_info or vram_info based on ASIC"
  drm/amd/display: Fix failure to read vram info due to static BP_RESULT
  drm/amdgpu: enable GTT fallback handling for dGPUs only
  drm/amd/amdgpu: limit single process inside MES
  drm/fourcc: add AMD_FMT_MOD_TILE_GFX9_4K_D_X
  drm/amdgpu/mes12: correct kiq unmap latency
  drm/amdgpu: Support vcn and jpeg error info parsing
  drm/amd : Update MES API header file for v11 & v12
  drm/amd/amdkfd: add/remove kfd queues on start/stop KFD scheduling
  drm/amdkfd: change kfd process kref count at creation
  drm/amdgpu: Cleanup shift coding style
  drm/amd/amdgpu: Increase MES log buffer to dump mes scratch data
  drm/amdgpu: Implement virt req_ras_err_count
  drm/amdgpu: VF Query RAS Caps from Host if supported
  drm/amdgpu: Add msg handlers for SRIOV RAS Telemetry
  drm/amdgpu: Update SRIOV Exchange Headers for RAS Telemetry Support
  drm/amd/display: 3.2.309
  drm/amd/display: Adjust VSDB parser for replay feature
  ...
2024-11-21 14:56:17 -08:00
Dave Airlie
b6ad7debf5 nouveau: handle EBUSY and EAGAIN for GSP aux errors.
The upper layer transfer functions expect EBUSY as a return
for when retries should be done.

Fix the AUX error translation, but also check for both errors
in a few places.

Fixes: eb284f4b37 ("drm/nouveau/dp: Honor GSP link training retry timeouts")
Cc: stable@vger.kernel.org
Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241111034126.2028401-1-airlied@gmail.com
2024-11-14 11:50:01 +10:00
Benjamin Szőke
231bb9b4c4 drm/nouveau/i2c: rename aux.c and aux.h to auxch.c and auxch.h
The goal is to clean-up Linux repository from AUX file names, because
the use of such file names is prohibited on other operating systems
such as Windows, so the Linux repository cannot be cloned and
edited on them.

Signed-off-by: Benjamin Szőke <egyszeregy@freemail.hu>
Reviewed-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20240603091558.35672-1-egyszeregy@freemail.hu
2024-10-03 20:05:16 +02:00
Thomas Zimmermann
2dd0ef5d95 Merge drm/drm-next into drm-misc-next
Get drm-misc-next to up v6.12-rc1.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-09-30 10:50:54 +02:00
Thomas Zimmermann
61b86391fb Merge drm/drm-next into drm-misc-next
Backmerging to get fixes from v6.12-rc7.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-09-11 09:48:49 +02:00
Ben Skeggs
6db9df4f70 drm/nouveau/fb: restore init() for ramgp102
init() was removed from ramgp102 when reworking the memory detection, as
it was thought that the code was only necessary when the driver performs
mclk changes, which nouveau doesn't support on pascal.

However, it turns out that we still need to execute this on some GPUs to
restore settings after DEVINIT, so revert to the original behaviour.

v2: fix tags in commit message, cc stable

Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/319
Fixes: 2c0c15a22f ("drm/nouveau/fb/gp102-ga100: switch to simpler vram size detection method")
Cc: stable@vger.kernel.org # 6.6+
Signed-off-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20240904232418.8590-1-bskeggs@nvidia.com
2024-09-10 12:22:48 +02:00
Li Zetao
c6430a8eb0 drm/nouveau/volt: use clamp() in nvkm_volt_map()
When it needs to get a value within a certain interval, using clamp()
makes the code easier to understand than min(max()).

Reviewed-by: Lyude Paul <lyude@redhat.com>
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Lyude Paul <lyude@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240831012803.3950100-4-lizetao1@huawei.com
2024-09-04 15:21:43 -04:00
Dave Airlie
f33b9ab049 nouveau: fix the fwsec sb verification register.
This aligns with what open gpu does, the 0x15 hex is just to trick you.

Fixes: 176fdcbddf ("drm/nouveau/gsp/r535: add support for booting GSP-RM")
Reviewed-by: Ben Skeggs <bskeggs@nvidia.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20240828023720.1596602-1-airlied@gmail.com
2024-08-31 00:23:45 +02:00
Maxime Ripard
375c4d1583
Merge drm/drm-next into drm-misc-next
Let's start the new release cycle.

Signed-off-by: Maxime Ripard <mripard@kernel.org>
2024-05-27 11:08:31 +02:00
Dave Airlie
09e10499ee Short summary of fixes pull:
imagination:
 - fix page-count macro
 
 nouveau:
 - avoid page-table allocation failures
 - fix firmware memory allocation
 
 panel:
 - ili9341: avoid OF for device properties; respect deferred probe; fix
 usage of errno codes
 
 ttm:
 - fix status output
 
 vmwgfx:
 - fix legacy display unit
 - fix read length in fence signalling
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmYz53QACgkQaA3BHVML
 eiP7Ggf/USKiwX3xhWFk3SLGCWYM+eflhPgyyTs4Ue8NYOd/0C7lDKtEWwSbPUlB
 6Tg8FZW0OF0eLYQjl7/A0sufj/225mR8jKCmwbBg8KherqPfThxChRqh7kjcsaHR
 pG8yleio7oPGg4qB4hi/PENB5kVFw6s1tVPOz/X8W7V2aEVglO+9IGup6CB3TQKu
 9kGHbgPVZ1SoDlXrmUmApfAy+Qqu4F8s3m8b6hYXvit3SqJl275aJIqL4fjw5Ufk
 DhbTfdL4ao28gOAIdRLuOlwpadMOl+rMTls8Q/FwW0Pk/S8Osu1QykhlhHBuBxzb
 bs3YmYwsMKeKofvLHwxrWqQVfH+TSA==
 =h9Cb
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-fixes-2024-05-02' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

Short summary of fixes pull:

imagination:
- fix page-count macro

nouveau:
- avoid page-table allocation failures
- fix firmware memory allocation

panel:
- ili9341: avoid OF for device properties; respect deferred probe; fix
usage of errno codes

ttm:
- fix status output

vmwgfx:
- fix legacy display unit
- fix read length in fence signalling

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20240502192117.GA12158@linux.fritz.box
2024-05-03 11:16:40 +10:00
Lyude Paul
6f572a8054 drm/nouveau/gsp: Use the sg allocator for level 2 of radix3
Currently we allocate all 3 levels of radix3 page tables using
nvkm_gsp_mem_ctor(), which uses dma_alloc_coherent() for allocating all of
the relevant memory. This can end up failing in scenarios where the system
has very high memory fragmentation, and we can't find enough contiguous
memory to allocate level 2 of the page table.

Currently, this can result in runtime PM issues on systems where memory
fragmentation is high - as we'll fail to allocate the page table for our
suspend/resume buffer:

  kworker/10:2: page allocation failure: order:7, mode:0xcc0(GFP_KERNEL),
  nodemask=(null),cpuset=/,mems_allowed=0
  CPU: 10 PID: 479809 Comm: kworker/10:2 Not tainted
  6.8.6-201.ChopperV6.fc39.x86_64 #1
  Hardware name: SLIMBOOK Executive/Executive, BIOS N.1.10GRU06 02/02/2024
  Workqueue: pm pm_runtime_work
  Call Trace:
   <TASK>
   dump_stack_lvl+0x64/0x80
   warn_alloc+0x165/0x1e0
   ? __alloc_pages_direct_compact+0xb3/0x2b0
   __alloc_pages_slowpath.constprop.0+0xd7d/0xde0
   __alloc_pages+0x32d/0x350
   __dma_direct_alloc_pages.isra.0+0x16a/0x2b0
   dma_direct_alloc+0x70/0x270
   nvkm_gsp_radix3_sg+0x5e/0x130 [nouveau]
   r535_gsp_fini+0x1d4/0x350 [nouveau]
   nvkm_subdev_fini+0x67/0x150 [nouveau]
   nvkm_device_fini+0x95/0x1e0 [nouveau]
   nvkm_udevice_fini+0x53/0x70 [nouveau]
   nvkm_object_fini+0xb9/0x240 [nouveau]
   nvkm_object_fini+0x75/0x240 [nouveau]
   nouveau_do_suspend+0xf5/0x280 [nouveau]
   nouveau_pmops_runtime_suspend+0x3e/0xb0 [nouveau]
   pci_pm_runtime_suspend+0x67/0x1e0
   ? __pfx_pci_pm_runtime_suspend+0x10/0x10
   __rpm_callback+0x41/0x170
   ? __pfx_pci_pm_runtime_suspend+0x10/0x10
   rpm_callback+0x5d/0x70
   ? __pfx_pci_pm_runtime_suspend+0x10/0x10
   rpm_suspend+0x120/0x6a0
   pm_runtime_work+0x98/0xb0
   process_one_work+0x171/0x340
   worker_thread+0x27b/0x3a0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xe5/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x31/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1b/0x30

Luckily, we don't actually need to allocate coherent memory for the page
table thanks to being able to pass the GPU a radix3 page table for
suspend/resume data. So, let's rewrite nvkm_gsp_radix3_sg() to use the sg
allocator for level 2. We continue using coherent allocations for lvl0 and
1, since they only take a single page.

V2:
* Don't forget to actually jump to the next scatterlist when we reach the
  end of the scatterlist we're currently on when writing out the page table
  for level 2

Signed-off-by: Lyude Paul <lyude@redhat.com>
Cc: stable@vger.kernel.org
Reviewed-by: Ben Skeggs <bskeggs@nvidia.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240429182318.189668-2-lyude@redhat.com
2024-04-30 12:45:42 -04:00
Chaitanya Kumar Borah
4a9a567ab1 nouveau: Add missing break statement
Add the missing break statement that causes the following build error

	  CC [M]  drivers/gpu/drm/i915/display/intel_display_device.o
	../drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c: In function ‘build_registry’:
	../drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:1266:3: error: label at end of compound statement
	1266 |   default:
	      |   ^~~~~~~
	  CC [M]  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.o
	  HDRTEST drivers/gpu/drm/xe/compat-i915-headers/i915_reg.h
	  CC [M]  drivers/gpu/drm/amd/amdgpu/imu_v11_0.o
	make[7]: *** [../scripts/Makefile.build:244: drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.o] Error 1
	make[7]: *** Waiting for unfinished jobs....

Fixes: b58a0bc904 ("nouveau: add command-line GSP-RM registry support")
Closes: https://lore.kernel.org/all/913052ca6c0988db1bab293cfae38529251b4594.camel@nvidia.com/T/#m3c9acebac754f2e74a85b76c858c093bb1aacaf0
Closes: https://lore.kernel.org/all/CA+G9fYu7Ug0K8h9QJT0WbtWh_LL9Juc+VC0WMU_Z_vSSPDNymg@mail.gmail.com/
Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240430131840.742924-1-chaitanya.kumar.borah@intel.com
2024-04-30 16:06:18 +02:00
Timur Tabi
b58a0bc904 nouveau: add command-line GSP-RM registry support
Add the NVreg_RegistryDwords command line parameter, which allows
specifying additional registry keys to be sent to GSP-RM.  This
allows additional configuration, debugging, and experimentation
with GSP-RM, which uses these keys to alter its behavior.

Note that these keys are passed as-is to GSP-RM, and Nouveau does
not parse them.  This is in contrast to the Nvidia driver, which may
parse some of the keys to configure some functionality in concert with
GSP-RM.  Therefore, any keys which also require action by the driver
may not function correctly when passed by Nouveau.  Caveat emptor.

The name and format of NVreg_RegistryDwords is the same as used by
the Nvidia driver, to maintain compatibility.

Signed-off-by: Timur Tabi <ttabi@nvidia.com>
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240417215317.3490856-1-ttabi@nvidia.com
2024-04-26 18:06:39 +02:00