Bug 1179925

Summary: Suspend to disk is broken on Thinkpad T495 vega 10 gpu
Product: [openSUSE] openSUSE Distribution Reporter: Ali Abdallah <ali.abdallah>
Component: BasesystemAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: ali.abdallah, tiwai
Version: Leap 15.2   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: dmesg failed failed resume from hibernate
dmesg failed resume from hibernate kernel 5.3.18-lp152.57
dmesg successful resume from hibernate kernel 5.4.0
dmesg failed resume from hibernate kernel 5.3.18-lp152.2.g53cb342
hwinfo

Description Ali Abdallah 2020-12-11 08:11:41 UTC
Created attachment 844380 [details]
dmesg failed failed resume from hibernate

Upon resume from hibernate on a Thinkpad T495 AMD Ryzen 7 PRO 3700U, the X server is totally frozen, the laptop is not totally frozen as I can connect to it normally with ssh. The issue I belive is on the amdgpu drm driver.

Kernel version 5.3.18-lp152.57-default

I've tested kernel 5.10.rc7-2.1.g9688120 from head and kernel 5.9.13-1.1.g3dfd18b from stable, suspend to disk works fine with both kernels.

Moreover, I've compiled the closest upstream lts kernel (v5.4.82) to the leap kernel and it works perfectly fine. By that I was hoping to diff the changes to identify the fix, but with no success so far.

I'm attaching dmesg after the machine resumed from hibernate, it shows the following amdgpu related errors.

[   28.548521] amdgpu 0000:06:00.0: GPU mode1 reset failed
[   28.548709] [drm:amdgpu_device_suspend [amdgpu]] *ERROR* amdgpu asic reset failed
[   29.216975] amdgpu 0000:06:00.0: [gfxhub] no-retry page fault (src_id:0 ring:222 vmid:1 pasid:0, for process  pid 0 thread  pid 0)
[   29.216979] amdgpu 0000:06:00.0:   in page starting at address 0x0000800000028000 from 27
[   29.216981] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x001009BC
[   29.227613] amdgpu: [powerplay] dpm has been enabled
Comment 1 Takashi Iwai 2020-12-11 08:46:24 UTC
It's not that trivial, unfortunately.  Between 5.3 and 5.4, amdgpu alone received over 900 patches, and those are massive:
 443 files changed, 220503 insertions(+), 8490 deletions(-)

We may check further, but in general, the support for the recent AMD chipset on Leap 15.2 is pretty limited.  It should be greatly improved in Leap 15.3, though.
Comment 2 Takashi Iwai 2020-12-11 08:49:42 UTC
BTW, could you try 5.4.0 kernel, and see whether the hibernate resume works?
If it's fixed between 5.4.0 and 5.4.82, we may easily bisect.
Comment 3 Ali Abdallah 2020-12-11 15:51:07 UTC
(In reply to Takashi Iwai from comment #2)
> BTW, could you try 5.4.0 kernel, and see whether the hibernate resume works?
> If it's fixed between 5.4.0 and 5.4.82, we may easily bisect.

Just checked, hibernate/resume "works" on 5.4.1, but not in a reliable way, in the sense that it doesn't work 2/3 times in a row.
Comment 4 Takashi Iwai 2020-12-11 16:16:04 UTC
(In reply to Ali Abdallah from comment #3)
> (In reply to Takashi Iwai from comment #2)
> > BTW, could you try 5.4.0 kernel, and see whether the hibernate resume works?
> > If it's fixed between 5.4.0 and 5.4.82, we may easily bisect.
> 
> Just checked, hibernate/resume "works" on 5.4.1, but not in a reliable way,
> in the sense that it doesn't work 2/3 times in a row.

Do you mean 5.4.0 didn't work at all but 5.4.1 works in some level?  Or all 5.4 works more or less from the beginning?

And 5.4.82 works more reliably, or it also shows the same problem?
Comment 5 Ali Abdallah 2020-12-14 08:06:42 UTC
(In reply to Takashi Iwai from comment #4)
> Do you mean 5.4.0 didn't work at all but 5.4.1 works in some level?  Or all
> 5.4 works more or less from the beginning?
> 
> And 5.4.82 works more reliably, or it also shows the same problem?

Hibernate with both 5.4.0 5.4.1 works, but unreliably.

On 5.3.18-lp152.57-default suspend works but hibernate doesn't work at all.

I'm running 5.4.82, as I didn't have any problem so far, hibernate and suspend to ram always work perfectly fine with this kernel version.
Comment 6 Takashi Iwai 2020-12-14 08:40:40 UTC
OK, thanks for clarification.

Could you give the dmesg outputs with drm.debug=0x0e boot options in both 5.3 and 5.4 kernels while hibernate/resume?
Comment 7 Ali Abdallah 2020-12-14 08:54:04 UTC
Created attachment 844419 [details]
dmesg failed resume from hibernate kernel 5.3.18-lp152.57
Comment 8 Ali Abdallah 2020-12-14 08:54:43 UTC
Created attachment 844420 [details]
dmesg successful resume from hibernate kernel 5.4.0
Comment 9 Takashi Iwai 2020-12-14 13:56:16 UTC
Thanks.  So the steps before the GPU reset seem mostly equivalent, and the difference appears after GPU mode1 reset -- maybe the changes in that part really play some role, as 5.4 seems to have changed the GPU reset method depending on the condition.

I tried to backport some relevant patches from 5.4, and the test kernel is being built in OBS home:tiwai:bsc1179925 repo.  It takes some time (an hour or so) until the build finishes.  Please give it a try later.
Comment 10 Ali Abdallah 2020-12-14 15:57:01 UTC
Created attachment 844441 [details]
dmesg failed resume from hibernate kernel 5.3.18-lp152.2.g53cb342

I gave it a try, but unfortunately hibernate is still broken.
Comment 11 Takashi Iwai 2020-12-14 16:43:26 UTC
Then something in powerplay stuff, maybe.

Meanwhile, could you give hwinfo output?  I guess your chip isn't properly supported by 5.3.x base in anyway, but it's better to improve something, of course.
Comment 12 Ali Abdallah 2020-12-14 17:03:06 UTC
Created attachment 844446 [details]
hwinfo
Comment 13 Takashi Iwai 2020-12-14 17:07:18 UTC
Thanks.  The kernel is being rebuilt with two more backport patches in the same OBS repo.  Let's see whether it works better.
Comment 14 Ali Abdallah 2020-12-14 20:00:23 UTC
(In reply to Takashi Iwai from comment #13)
> Thanks.  The kernel is being rebuilt with two more backport patches in the
> same OBS repo.  Let's see whether it works better.

Same result with 5.3.18-lp152.3.1.gb4566a6.
Comment 15 Takashi Iwai 2020-12-15 07:08:15 UTC
OK, that's what I was afraid of; for fixing this, we'd need backport tons of patches, as this is rather a new GPU model (Raven2 or such), and another GPU reset model is needed that needs another infrastructure in amdgpu, etc, etc.
Comment 16 Ali Abdallah 2020-12-15 13:12:56 UTC
(In reply to Takashi Iwai from comment #15)
> OK, that's what I was afraid of; for fixing this, we'd need backport tons of
> patches, as this is rather a new GPU model (Raven2 or such), and another GPU
> reset model is needed that needs another infrastructure in amdgpu, etc, etc.

The laptop is not really that recent, it was released on May 2019, it works fine with 5.4.62 (lowest version I've tried in the 5.4.x series that gave me no issues at all with the GPU).

In an ideal world, one could just drop the new amdgpu drm code that contains support for new GPUs and important fixes for older ones, compile it and use it, but the drm subsystem kpi constant changes makes it hard/impossible.

I do also perfectly understand the issue with the amdgpu code, it is improving, but fixes are mixed with new features, which makes the backporting work hard. 

BTW I had other artifacts/issues related to the GPU with version 5.3.18-lp152.57-default, so I would just say that my GPU is not well supported, and If you agree, I would close this as a feature request.
Comment 17 Takashi Iwai 2020-12-15 13:26:00 UTC
Heh, it's about the relativity of the oldness, 5.3 is much older :)

So indeed, this is likely a WONTFIX issue for Leap 15.2 officially.  Leap 15.3 will receive the full backport up to 5.9.x equivalent or later, and it should work as is.  You can try the kernel in OBS Kernel:SLE15-SP3 repo if you're interested in.

OTOH, I'm considering to provide some alternative way for the temporary update via KMP or such.  But this won't be an official update and likely provided via OBS home:tiwai:* project (and I can't promise when it'll be available).
Comment 18 Ali Abdallah 2020-12-15 13:45:02 UTC
(In reply to Takashi Iwai from comment #17)
> Heh, it's about the relativity of the oldness, 5.3 is much older :)

5.3 and 5.4 at least as numbers seem relatively close to me to have incompatible drm subsystem. But anyway that is an upstream thing...

> 
> So indeed, this is likely a v5.4.82 issue for Leap 15.2 officially.  Leap
> 15.3 will receive the full backport up to 5.9.x equivalent or later, and it
> should work as is.  You can try the kernel in OBS Kernel:SLE15-SP3 repo if
> you're interested in.

I'm currently running v5.4.82 without issues, but will give Kernel:SLE15-SP3 a try next time I have to reboot the machine. Please feel free to close this at your earlier convenience.

> OTOH, I'm considering to provide some alternative way for the temporary
> update via KMP or such.  But this won't be an official update and likely
> provided via OBS home:tiwai:* project (and I can't promise when it'll be
> available).

That would be nice. I mean, since backporting is not always an option, it would be very nice to have an alternative easy to install option, for example the latest upstream LTS kernel ready to used in the official or some add-on repository.
Comment 19 Takashi Iwai 2021-02-03 15:18:24 UTC
I recently backported a few amdgpu fixes on Leap 15.2 that might be relevant with the suspend/resume.  Let's hope that.

Providing a temporary KMP isn't proceeded yet due to lack of time.  Maybe trying Leap 15.3 kernel can be a better choice.
Comment 20 Ali Abdallah 2021-02-05 07:07:56 UTC
(In reply to Takashi Iwai from comment #19)
> I recently backported a few amdgpu fixes on Leap 15.2 that might be relevant
> with the suspend/resume.  Let's hope that.

Thanks, will give a try.

> Providing a temporary KMP isn't proceeded yet due to lack of time.  Maybe
> trying Leap 15.3 kernel can be a better choice.

No problem, I'm currently running 5.10.12 on Leap 15.2 without any issue.
Comment 21 Ali Abdallah 2021-02-26 08:03:15 UTC
(In reply to Takashi Iwai from comment #19)
> I recently backported a few amdgpu fixes on Leap 15.2 that might be relevant
> with the suspend/resume.  Let's hope that.

I've tested today with kernel 5.3.18-lp152.63.1, the system froze before even entering S4 state.
Comment 22 Ali Abdallah 2021-10-28 18:32:51 UTC
Closing as WONTFIX, I'm using OpenSUSE 15.3 and hibernate works fine now on my T495.