Bug 904837 - systemd-journald - Program terminated with signal SIGBUS, Bus error.
systemd-journald - Program terminated with signal SIGBUS, Bus error.
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
201411*
x86-64 openSUSE 13.2
: P5 - None : Normal (vote)
: ---
Assigned To: systemd maintainers
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-11-11 09:52 UTC by Martin Pluskal
Modified: 2015-04-09 09:00 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
core (580.00 KB, application/octet-stream)
2014-11-11 09:52 UTC, Martin Pluskal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Pluskal 2014-11-11 09:52:00 UTC
Created attachment 613168 [details]
core

I have noticed that systemd-journald has crashed when there was not enough space on root device:
Core was generated by `/usr/lib/systemd/systemd-journald'.
Program terminated with signal SIGBUS, Bus error.
#0  0x0000000000426d5b in sd_id128_get_machine (ret=0x7f2c8f0ec028) at src/libsystemd/sd-id128/sd-id128.c:118
118     src/libsystemd/sd-id128/sd-id128.c: No such file or directory.
(gdb) bt
#0  0x0000000000426d5b in sd_id128_get_machine (ret=0x7f2c8f0ec028) at src/libsystemd/sd-id128/sd-id128.c:118
#1  0x00000000004157f7 in journal_file_refresh_header (f=0x9212e0) at src/journal/journal-file.c:205
#2  journal_file_open (fname=fname@entry=0x7fff634d72a0 "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal", flags=flags@entry=66, mode=mode@entry=416, compress=compress@entry=true, seal=seal@entry=true, 
    metrics=metrics@entry=0x7fff634d75d8, mmap_cache=mmap_cache@entry=0x920250, template=template@entry=0x0, ret=ret@entry=0x7fff634d7568) at src/journal/journal-file.c:2585
#3  0x00000000004163fc in journal_file_open_reliably (fname=fname@entry=0x7fff634d72a0 "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal", flags=flags@entry=66, mode=mode@entry=416, compress=true, seal=true, 
    metrics=metrics@entry=0x7fff634d75d8, mmap_cache=0x920250, template=template@entry=0x0, ret=ret@entry=0x7fff634d7568) at src/journal/journal-file.c:2683
#4  0x000000000040ce68 in system_journal_open (s=s@entry=0x7fff634d74f0) at src/journal/journald-server.c:984
#5  0x000000000040dee4 in server_init (s=s@entry=0x7fff634d74f0) at src/journal/journald-server.c:1628
#6  0x0000000000406468 in main (argc=<optimized out>, argv=<optimized out>) at src/journal/journald.c:57
(gdb) disassemble $pc-32,$pc+32
Dump of assembler code from 0x426d3b to 0x426d7b:
   0x0000000000426d3b <sd_id128_get_machine+43>:        mov    %fs:0x0,%rax
   0x0000000000426d44 <sd_id128_get_machine+52>:        cmpb   $0x0,-0x28(%rax)
   0x0000000000426d4b <sd_id128_get_machine+59>:        je     0x426d90 <sd_id128_get_machine+128>
   0x0000000000426d4d <sd_id128_get_machine+61>:        mov    -0x30(%rax),%rdx
   0x0000000000426d54 <sd_id128_get_machine+68>:        mov    -0x38(%rax),%rax
=> 0x0000000000426d5b <sd_id128_get_machine+75>:        mov    %rdx,0x8(%rbp)
   0x0000000000426d5f <sd_id128_get_machine+79>:        mov    %rax,0x0(%rbp)
   0x0000000000426d63 <sd_id128_get_machine+83>:        xor    %eax,%eax
   0x0000000000426d65 <sd_id128_get_machine+85>:        mov    0x48(%rsp),%rcx
   0x0000000000426d6a <sd_id128_get_machine+90>:        xor    %fs:0x28,%rcx
   0x0000000000426d73 <sd_id128_get_machine+99>:        jne    0x426e97 <sd_id128_get_machine+391>
   0x0000000000426d79 <sd_id128_get_machine+105>:       add    $0x50,%rsp
End of assembler dump.
(gdb) p $pc
$1 = (void (*)()) 0x426d5b <sd_id128_get_machine+75>
Comment 1 Dr. Werner Fink 2014-11-11 10:28:51 UTC
What does this mean?  Show us the result of

          df

how you have filled up your root device?
Comment 2 Dr. Werner Fink 2014-11-11 10:46:29 UTC
Hmmm ... after reading sd_id128_get_machine() from src/libsystemd/sd-id128/sd-id128.c this becomes strange ...

 _public_ int sd_id128_get_machine(sd_id128_t *ret) {
         static thread_local sd_id128_t saved_machine_id;
         static thread_local bool saved_machine_id_valid = false;
         _cleanup_close_ int fd = -1;
         char buf[33];
         ssize_t k;
         unsigned j;
         sd_id128_t t;

         assert_return(ret, -EINVAL);

         if (saved_machine_id_valid) {
                 *ret = saved_machine_id;       <<< line 118
                 return 0;
         }

         fd = open("/etc/machine-id", O_RDONLY|O_CLOEXEC|O_NOCTTY);
         if (fd < 0)
                 return -errno;

         [...]

         saved_machine_id = t;
         saved_machine_id_valid = true;

         *ret = t;
         return 0;
 }

... how can this happen? The value of static saved_machine_id is valid after the first open of /etc/machine-id followed by the calcluation and it remains valid ... beside a corrupted memory!
Comment 3 Thomas Blume 2014-11-11 11:18:13 UTC
Yeah, seems to be corrupted memory.
The context with the file system might be high i/o load.
Maybe parts of systemd-journald were swapped out.
Martin, was there swap usage at the time when the crash happened?
Comment 4 Martin Pluskal 2014-11-11 12:02:08 UTC
Hmm, issue with full disk is/was related to bnc#904023 which was triggered by trinity running (http://codemonkey.org.uk/projects/trinity/ - running as nobody), there is other core.dump at http://w3.suse.de/~mpluskal/904837/) - if the issue is related to memory corruption, could it be kernel issue?
Comment 5 Dr. Werner Fink 2014-11-11 12:08:06 UTC
(In reply to Martin Pluskal from comment #4)

Indeed this could be a kernel issue ...
Comment 6 Vlastimil Babka 2014-11-11 14:12:17 UTC
(In reply to Dr. Werner Fink from comment #2)
> Hmmm ... after reading sd_id128_get_machine() from
> src/libsystemd/sd-id128/sd-id128.c this becomes strange ...
> 
>  _public_ int sd_id128_get_machine(sd_id128_t *ret) {
>          static thread_local sd_id128_t saved_machine_id;
>          static thread_local bool saved_machine_id_valid = false;
>          _cleanup_close_ int fd = -1;
>          char buf[33];
>          ssize_t k;
>          unsigned j;
>          sd_id128_t t;
> 
>          assert_return(ret, -EINVAL);
> 
>          if (saved_machine_id_valid) {
>                  *ret = saved_machine_id;       <<< line 118
>                  return 0;
>          }
> 
>          fd = open("/etc/machine-id", O_RDONLY|O_CLOEXEC|O_NOCTTY);
>          if (fd < 0)
>                  return -errno;
> 
>          [...]
> 
>          saved_machine_id = t;
>          saved_machine_id_valid = true;
> 
>          *ret = t;
>          return 0;
>  }
> 
> ... how can this happen? The value of static saved_machine_id is valid after
> the first open of /etc/machine-id followed by the calcluation and it remains
> valid ... beside a corrupted memory!

The problem is apparently not with saved_machine_id, but with 'ret' pointing to a wrong memory:

Program terminated with signal SIGBUS, Bus error.
#0  0x0000000000426d5b in sd_id128_get_machine (ret=0x7f2c8f0ec028) at src/libsystemd/sd-id128/sd-id128.c:118
(gdb) p saved_machine_id
$1 = {bytes = "\312|\366;W6E\275\257\000\060éN}}", qwords = {13638366794857479370, 9042470117860966575}}
(gdb) p ret
$2 = (sd_id128_t *) 0x7f2c8f0ec028
(gdb) p *ret
Cannot access memory at address 0x7f2c8f0ec028
(gdb) up
#1  0x00000000004157f7 in journal_file_refresh_header (f=0x9212e0) at src/journal/journal-file.c:205
  >│205             r = sd_id128_get_machine(&f->header->machine_id);
(gdb) up
#2  journal_file_open (fname=fname@entry=0x7fff634d72a0 "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal", flags=flags@entry=66, mode=mode@entry=416, compress=compress@entry=true, 
    seal=seal@entry=true, metrics=metrics@entry=0x7fff634d75d8, mmap_cache=mmap_cache@entry=0x920250, template=template@entry=0x0, ret=ret@entry=0x7fff634d7568) at src/journal/journal-file.c:2585

It appears it had mmapped file "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal"
The file does exist on the system:
Access: (0640/-rw-r-----)  Uid: (    0/    root)   Gid: (  494/systemd-journal)

SIGBUS occurs when somebody truncates the file between mmap and accessing pointer beyond the point of truncation. At this moment it seems that the file is not truncated so much to fail writing on the first page of it (machine_id is at offset 28), but it appears the file is still being modified, so it could have been truncated to zero at the time of the bug?
Comment 7 Dr. Werner Fink 2014-11-11 15:08:01 UTC
(In reply to Dr. Werner Fink from comment #5)

Does this mean that swapping may truncated the file?  Or can this also be caused by disk space shortage?  The question rises how to detect that f->header->machine_id from JournalFile pointer f is beyound the end of the mapped region.
Comment 8 Vlastimil Babka 2014-11-11 16:08:12 UTC
(In reply to Werner Fink from comment #7)
> (In reply to Dr. Werner Fink from comment #5)
> 
> Does this mean that swapping may truncated the file?  Or can this also be
> caused by disk space shortage?  The question rises how to detect that
> f->header->machine_id from JournalFile pointer f is beyound the end of the
> mapped region.

Swapping alone shouldn't cause this, if data was swapped out, it would be just swapped back in. Unless it was truncated in parallel. Also I would expect that disk shortage shouldn't truncate files, but I'm not a btrfs (or even fs) developer :/
Comment 9 Martin Pluskal 2014-11-18 07:33:28 UTC
I am wondering if it could be related to bnc#905832
Comment 10 Dr. Werner Fink 2014-11-18 07:48:31 UTC
(In reply to Martin Pluskal from comment #9)

Ouch ... I was not aware that we're talking about BtrFS, both bugs

 boo#905832: Please backport mmap btrfs fixes/or update to kernel 3.16.7/3.17.3
 boo#904023: WARNING: CPU: 2 PID: 31066 at ../fs/btrfs/extent-tree.c:3799 

rises the question: does this bug happen without using BtrFS?
Comment 11 Dr. Werner Fink 2014-11-18 13:26:02 UTC
Can you reproduce the SIGBUS on an e.g. xfs based file system for /var/log?
Comment 12 Martin Pluskal 2014-11-18 13:33:45 UTC
Already trying with xfs, so far not reproduced.
Comment 13 Martin Pluskal 2015-04-08 07:04:36 UTC
Haven't seen this issue in a while, feel free to close.
Comment 14 Thomas Blume 2015-04-09 09:00:16 UTC
(In reply to Martin Pluskal from comment #13)
> Haven't seen this issue in a while, feel free to close.

Thanks for the feedback, closing.