Bug 1215367 - [Build D.65.3] journal_check: pam_wtmpdb: database locked
Summary: [Build D.65.3] journal_check: pam_wtmpdb: database locked
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Stefan Schubert
QA Contact: E-mail List
URL: https://openqa.opensuse.org/tests/357...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-15 07:21 UTC by Dominique Leuenberger
Modified: 2023-10-25 08:08 UTC (History)
2 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique Leuenberger 2023-09-15 07:21:52 UTC
## Observation

We see errors like this popping up in the journal:

pam_wtmpdb(login:session): Failed to execute statement: database is locked

openQA test in scenario microos-Staging:D-Staging-MicroOS-Image-ContainerHost-x86_64-microos@64bit fails in
[journal_check](https://openqa.opensuse.org/tests/3578202/modules/journal_check/steps/32)

## Test suite description



## Reproducible

Fails since (at least) Build [D.65.3](https://openqa.opensuse.org/tests/3576202)


## Expected result

Last good: [D.65.2](https://openqa.opensuse.org/tests/3575699) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=microos&flavor=Staging-MicroOS-Image-ContainerHost&machine=64bit&test=microos&version=Staging%3AD)
Comment 1 Thorsten Kukuk 2023-09-15 08:09:27 UTC
The errors are coming from sqlite and my suspicion is parallel access to wtmpdb.
Which would be the same problem as utmp/wtmp have, which is not surprising at all:
https://sourceware.org/bugzilla/show_bug.cgi?id=24492

Maybe we can make the timeout to get the lock for sqlite longer? Or repeat to get the lock a few times? But in every case, we should not block because of this. An error in the log is much better than that a normal user can block root or other users or processes from being able to login.

And we should extend the error message, so that we know in which phase this happens, if possible.
Comment 2 Dominique Leuenberger 2023-09-15 10:45:43 UTC
This also begs the question then if it should be logged as error, warning, or information in the journal
Comment 3 Stefan Schubert 2023-09-15 10:52:38 UTC
Hm, I have tried an installation with this ISO but:
journalctl --no-pager --quiet -p err -o short-precise
does not return this error. 
Maybe it is a timing issue. Anyway I will have a look to the source
code and will follow Thorsten's suggestions....
Comment 4 Stefan Schubert 2023-09-15 10:58:37 UTC
It seems that the login has gone successfully, right ?

I am wondering where 

Sep 15 06:30:49.316433 localhost.localdomain systemctl[1568]: systemctl: invalid option -- '.'

does come from so often and why

Sep 15 06:54:00.970453 localhost.localdomain root[942]: ERROR: "/usr/libexec/health-checker/fail.sh check" failed
Sep 15 06:54:01.168261 localhost.localdomain root[989]: Machine didn't come up correctly, do a rollback
Comment 5 Stefan Schubert 2023-09-15 10:59:11 UTC
Added Thorsten to CC
Comment 6 Thorsten Kukuk 2023-09-15 11:56:55 UTC
(In reply to Stefan Schubert from comment #3)
> Hm, I have tried an installation with this ISO but:
> journalctl --no-pager --quiet -p err -o short-precise
> does not return this error. 
> Maybe it is a timing issue. Anyway I will have a look to the source
> code and will follow Thorsten's suggestions....

The error can only happen, if either:
1. the wtmp.db is broken, because a process writing to it got killed. I don't think that's the case here.
2. the wtmp.db file is broken for other reasons
3. two or more processes try to write to the database at the same time.

I think it's 3.

What you could try:
write a small C program which opens wtmp.db in read-write mode, sleeps 30 seconds (or whatever you need) and closes it.

During the sleep time, you should try to login on the console.
I expect:
1. login on console should work
2. you will see the error message in the journal
Comment 7 Thorsten Kukuk 2023-09-15 11:58:16 UTC
(In reply to Stefan Schubert from comment #4)
> It seems that the login has gone successfully, right ?
> 
> I am wondering where 
> 
> Sep 15 06:30:49.316433 localhost.localdomain systemctl[1568]: systemctl:
> invalid option -- '.'
> 
> does come from so often 

I think Fabian fixed that now.

> and why
> 
> Sep 15 06:54:00.970453 localhost.localdomain root[942]: ERROR:
> "/usr/libexec/health-checker/fail.sh check" failed
> Sep 15 06:54:01.168261 localhost.localdomain root[989]: Machine didn't come
> up correctly, do a rollback

This looks like a QA test case, let health-checker fail and look, if we make a rollback.
Comment 8 Thorsten Kukuk 2023-09-15 12:01:02 UTC
(In reply to Dominique Leuenberger from comment #2)
> This also begs the question then if it should be logged as error, warning,
> or information in the journal

We don't know the reason why the wtmp.db is locked (could also be another problem) and wtmp.db has afterwards either incomplete entries or some are missing. So I would stay with the error.
Comment 9 Stefan Schubert 2023-09-15 14:25:41 UTC
(In reply to Thorsten Kukuk from comment #6)
> (In reply to Stefan Schubert from comment #3)
> > Hm, I have tried an installation with this ISO but:
> > journalctl --no-pager --quiet -p err -o short-precise
> > does not return this error. 
> > Maybe it is a timing issue. Anyway I will have a look to the source
> > code and will follow Thorsten's suggestions....
> 
> The error can only happen, if either:
> 1. the wtmp.db is broken, because a process writing to it got killed. I
> don't think that's the case here.
> 2. the wtmp.db file is broken for other reasons
> 3. two or more processes try to write to the database at the same time.
> 
> I think it's 3.
> 
> What you could try:
> write a small C program which opens wtmp.db in read-write mode, sleeps 30
> seconds (or whatever you need) and closes it.
> 
> During the sleep time, you should try to login on the console.
> I expect:
> 1. login on console should work
> 2. you will see the error message in the journal

I have stopped a process after sqlite3_open_v2 (rw open) and have been able to login on another terminal without an error.

I will check the code regarding these possibilities:
https://www2.sqlite.org/cvstrac/wiki?p=DatabaseIsLocked 

espl. around the db rotate call......
Comment 10 Stefan Schubert 2023-09-18 14:14:14 UTC
I cannot reproduce the error in all cases.
I have added a timeout of 5 seconds now and have increased the logging in order to see where it happens:
https://github.com/thkukuk/wtmpdb/pull/9
I will take that bug in order to see the results.....
Comment 11 Stefan Schubert 2023-10-25 08:08:53 UTC
As far I am remembering the bug has gone some time ago with this fix.
Dominique, please reopen it if there is still a problem.