Bugzilla – Bug 904837
systemd-journald - Program terminated with signal SIGBUS, Bus error.
Last modified: 2015-04-09 09:00:16 UTC
Created attachment 613168 [details] core I have noticed that systemd-journald has crashed when there was not enough space on root device: Core was generated by `/usr/lib/systemd/systemd-journald'. Program terminated with signal SIGBUS, Bus error. #0 0x0000000000426d5b in sd_id128_get_machine (ret=0x7f2c8f0ec028) at src/libsystemd/sd-id128/sd-id128.c:118 118 src/libsystemd/sd-id128/sd-id128.c: No such file or directory. (gdb) bt #0 0x0000000000426d5b in sd_id128_get_machine (ret=0x7f2c8f0ec028) at src/libsystemd/sd-id128/sd-id128.c:118 #1 0x00000000004157f7 in journal_file_refresh_header (f=0x9212e0) at src/journal/journal-file.c:205 #2 journal_file_open (fname=fname@entry=0x7fff634d72a0 "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal", flags=flags@entry=66, mode=mode@entry=416, compress=compress@entry=true, seal=seal@entry=true, metrics=metrics@entry=0x7fff634d75d8, mmap_cache=mmap_cache@entry=0x920250, template=template@entry=0x0, ret=ret@entry=0x7fff634d7568) at src/journal/journal-file.c:2585 #3 0x00000000004163fc in journal_file_open_reliably (fname=fname@entry=0x7fff634d72a0 "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal", flags=flags@entry=66, mode=mode@entry=416, compress=true, seal=true, metrics=metrics@entry=0x7fff634d75d8, mmap_cache=0x920250, template=template@entry=0x0, ret=ret@entry=0x7fff634d7568) at src/journal/journal-file.c:2683 #4 0x000000000040ce68 in system_journal_open (s=s@entry=0x7fff634d74f0) at src/journal/journald-server.c:984 #5 0x000000000040dee4 in server_init (s=s@entry=0x7fff634d74f0) at src/journal/journald-server.c:1628 #6 0x0000000000406468 in main (argc=<optimized out>, argv=<optimized out>) at src/journal/journald.c:57 (gdb) disassemble $pc-32,$pc+32 Dump of assembler code from 0x426d3b to 0x426d7b: 0x0000000000426d3b <sd_id128_get_machine+43>: mov %fs:0x0,%rax 0x0000000000426d44 <sd_id128_get_machine+52>: cmpb $0x0,-0x28(%rax) 0x0000000000426d4b <sd_id128_get_machine+59>: je 0x426d90 <sd_id128_get_machine+128> 0x0000000000426d4d <sd_id128_get_machine+61>: mov -0x30(%rax),%rdx 0x0000000000426d54 <sd_id128_get_machine+68>: mov -0x38(%rax),%rax => 0x0000000000426d5b <sd_id128_get_machine+75>: mov %rdx,0x8(%rbp) 0x0000000000426d5f <sd_id128_get_machine+79>: mov %rax,0x0(%rbp) 0x0000000000426d63 <sd_id128_get_machine+83>: xor %eax,%eax 0x0000000000426d65 <sd_id128_get_machine+85>: mov 0x48(%rsp),%rcx 0x0000000000426d6a <sd_id128_get_machine+90>: xor %fs:0x28,%rcx 0x0000000000426d73 <sd_id128_get_machine+99>: jne 0x426e97 <sd_id128_get_machine+391> 0x0000000000426d79 <sd_id128_get_machine+105>: add $0x50,%rsp End of assembler dump. (gdb) p $pc $1 = (void (*)()) 0x426d5b <sd_id128_get_machine+75>
What does this mean? Show us the result of df how you have filled up your root device?
Hmmm ... after reading sd_id128_get_machine() from src/libsystemd/sd-id128/sd-id128.c this becomes strange ... _public_ int sd_id128_get_machine(sd_id128_t *ret) { static thread_local sd_id128_t saved_machine_id; static thread_local bool saved_machine_id_valid = false; _cleanup_close_ int fd = -1; char buf[33]; ssize_t k; unsigned j; sd_id128_t t; assert_return(ret, -EINVAL); if (saved_machine_id_valid) { *ret = saved_machine_id; <<< line 118 return 0; } fd = open("/etc/machine-id", O_RDONLY|O_CLOEXEC|O_NOCTTY); if (fd < 0) return -errno; [...] saved_machine_id = t; saved_machine_id_valid = true; *ret = t; return 0; } ... how can this happen? The value of static saved_machine_id is valid after the first open of /etc/machine-id followed by the calcluation and it remains valid ... beside a corrupted memory!
Yeah, seems to be corrupted memory. The context with the file system might be high i/o load. Maybe parts of systemd-journald were swapped out. Martin, was there swap usage at the time when the crash happened?
Hmm, issue with full disk is/was related to bnc#904023 which was triggered by trinity running (http://codemonkey.org.uk/projects/trinity/ - running as nobody), there is other core.dump at http://w3.suse.de/~mpluskal/904837/) - if the issue is related to memory corruption, could it be kernel issue?
(In reply to Martin Pluskal from comment #4) Indeed this could be a kernel issue ...
(In reply to Dr. Werner Fink from comment #2) > Hmmm ... after reading sd_id128_get_machine() from > src/libsystemd/sd-id128/sd-id128.c this becomes strange ... > > _public_ int sd_id128_get_machine(sd_id128_t *ret) { > static thread_local sd_id128_t saved_machine_id; > static thread_local bool saved_machine_id_valid = false; > _cleanup_close_ int fd = -1; > char buf[33]; > ssize_t k; > unsigned j; > sd_id128_t t; > > assert_return(ret, -EINVAL); > > if (saved_machine_id_valid) { > *ret = saved_machine_id; <<< line 118 > return 0; > } > > fd = open("/etc/machine-id", O_RDONLY|O_CLOEXEC|O_NOCTTY); > if (fd < 0) > return -errno; > > [...] > > saved_machine_id = t; > saved_machine_id_valid = true; > > *ret = t; > return 0; > } > > ... how can this happen? The value of static saved_machine_id is valid after > the first open of /etc/machine-id followed by the calcluation and it remains > valid ... beside a corrupted memory! The problem is apparently not with saved_machine_id, but with 'ret' pointing to a wrong memory: Program terminated with signal SIGBUS, Bus error. #0 0x0000000000426d5b in sd_id128_get_machine (ret=0x7f2c8f0ec028) at src/libsystemd/sd-id128/sd-id128.c:118 (gdb) p saved_machine_id $1 = {bytes = "\312|\366;W6E\275\257\000\060éN}}", qwords = {13638366794857479370, 9042470117860966575}} (gdb) p ret $2 = (sd_id128_t *) 0x7f2c8f0ec028 (gdb) p *ret Cannot access memory at address 0x7f2c8f0ec028 (gdb) up #1 0x00000000004157f7 in journal_file_refresh_header (f=0x9212e0) at src/journal/journal-file.c:205 >│205 r = sd_id128_get_machine(&f->header->machine_id); (gdb) up #2 journal_file_open (fname=fname@entry=0x7fff634d72a0 "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal", flags=flags@entry=66, mode=mode@entry=416, compress=compress@entry=true, seal=seal@entry=true, metrics=metrics@entry=0x7fff634d75d8, mmap_cache=mmap_cache@entry=0x920250, template=template@entry=0x0, ret=ret@entry=0x7fff634d7568) at src/journal/journal-file.c:2585 It appears it had mmapped file "/var/log/journal/ca7cf63b573645bdaf0030c3a94e7d7d/system.journal" The file does exist on the system: Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 494/systemd-journal) SIGBUS occurs when somebody truncates the file between mmap and accessing pointer beyond the point of truncation. At this moment it seems that the file is not truncated so much to fail writing on the first page of it (machine_id is at offset 28), but it appears the file is still being modified, so it could have been truncated to zero at the time of the bug?
(In reply to Dr. Werner Fink from comment #5) Does this mean that swapping may truncated the file? Or can this also be caused by disk space shortage? The question rises how to detect that f->header->machine_id from JournalFile pointer f is beyound the end of the mapped region.
(In reply to Werner Fink from comment #7) > (In reply to Dr. Werner Fink from comment #5) > > Does this mean that swapping may truncated the file? Or can this also be > caused by disk space shortage? The question rises how to detect that > f->header->machine_id from JournalFile pointer f is beyound the end of the > mapped region. Swapping alone shouldn't cause this, if data was swapped out, it would be just swapped back in. Unless it was truncated in parallel. Also I would expect that disk shortage shouldn't truncate files, but I'm not a btrfs (or even fs) developer :/
I am wondering if it could be related to bnc#905832
(In reply to Martin Pluskal from comment #9) Ouch ... I was not aware that we're talking about BtrFS, both bugs boo#905832: Please backport mmap btrfs fixes/or update to kernel 3.16.7/3.17.3 boo#904023: WARNING: CPU: 2 PID: 31066 at ../fs/btrfs/extent-tree.c:3799 rises the question: does this bug happen without using BtrFS?
Can you reproduce the SIGBUS on an e.g. xfs based file system for /var/log?
Already trying with xfs, so far not reproduced.
Haven't seen this issue in a while, feel free to close.
(In reply to Martin Pluskal from comment #13) > Haven't seen this issue in a while, feel free to close. Thanks for the feedback, closing.