Bug 91189

Summary: data corruption on DVD-RAM
Product: [openSUSE] openSUSE 10.2 Reporter: Stanislav Brabec <sbrabec>
Component: KernelAssignee: Tony Jones <tonyj>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: jack, jeffm, nadvornik
Version: Final   
Target Milestone: ---   
Hardware: i586   
OS: All   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: blktrace.out.bz2

Description Stanislav Brabec 2005-06-16 12:12:55 UTC
I have a DVD-RW/RAM LG GSA-4120B connected to USB2 external case.

eject /dev/sr0
mkudffs --utf8 --media-type=dvdram --vid=corrupted /dev/sr0
eject /dev/sr0
eject -t /dev/sr0
sleep 30
mkdir /media/corrupted/test
for i in $(seq 64) ; do touch /media/corrupted/test/$i ; done
ls /media/corrupted/test
eject /dev/sr0
eject -t /dev/sr0
sleep 30
ls -R /media/corrupted

... There is no directory test in second ls.

Note that man eject says:
If the device is currently mounted, it is unmounted before ejecting.

And syslog says:
VFS: busy inodes on changed media.

I suspect, that this bug can occur even without eject command, but I am unable
to repeat it.

Maybe related: bug 74402
Comment 1 Jeff Mahoney 2005-06-20 14:56:54 UTC
I don't believe this bug is related to #74402. That bug is an issue with vfat
not honoring -osync and getting corruption during an eject-before-umount case.

I suspect this bug has to do with inode/dentry refcounting in udffs. Since this
is a CDROM device, it won't eject unless it can umount the filesystem first.
Comment 2 Jan Kara 2005-06-28 15:31:43 UTC
I've been playing with the drive here. The main result is that without 'subfs'
everything seems to be working fine and no data is lost.

I'll try to find out more later.
Comment 3 Stanislav Brabec 2005-06-29 09:00:09 UTC
Can you confirm your second result, that some drives report disk change, even if
medium removal is prevented and medium unload fails? If it is true, it probably
means, that kernel should track or check door lock status before sending eject.
Comment 4 Jan Kara 2005-06-29 09:24:33 UTC
Yes, on my notebook (HP Omnibook 6000) media-changed for the CD-ROM drive is set
even though the previous CDROMEJECT was unsuccessful and returned EIO. OTOH I
have not been able to reproduce the behaviour on any other computer...
Comment 5 Stanislav Brabec 2005-07-26 11:03:02 UTC
It seems, that I am able to repeat this bug even without subfs by calling eject
without previous sync. Vanilla kernel 2.6.12.1.

Symptoms are the same (used inodes on mounted medium in syslog, corrupted
directory or file).

But I am not yet able to simply reproduce this corruption.
Comment 6 Marian Jancar 2005-07-26 14:59:50 UTC
When no eject method specified, eject tries them all in this order until one
succeeds (or all fai):

ioctl(fd, CDROMEJECT);
series of ioctl(fd, SCSI_IOCTL_SEND_COMMAND, (void *)&scsi_cmd);
ioctl(fd, FDEJECT);
ioctl(fd, MTIOCTOP, &op);

I would think the burner dislikes especialy the SCSI commands. Could you please
try to reproduce with each eject mehod?
Comment 7 Stanislav Brabec 2005-07-26 15:14:29 UTC
I guess that before trying to send eject ioctl, eject calls umount and umount
should write all buffers to medium.

As I wrote, I am not able to reproduce this problem in a simple way without subfs.
Comment 8 Jan Kara 2005-07-27 08:08:24 UTC
Yes, the problem with the eject command is, that it call umount(1) and that does
sync before unmounting. I'll extend my ejecting program to not only call
CDROMEJECT (which is the right command and works) but also call umount(2) so we
avoid syncing.

And just for record - I reproduced the problem with ext2+subfs so it's probably
not a UDF problem.
Comment 9 Jan Kara 2005-07-28 16:12:29 UTC
So the problem arise when ioctl CDROMEJECT is called before umount. The ioctl 
returns EIO but the medium is ejected anyway and hence kernel starts to spit 
errors and the rest is obvious. Jens, what is the expected behaviour? I think 
that returning EIO but ejecting the CD is wrong but that seems more like the 
drive's fault... Do you have some suggestion how to fix it? 
Comment 10 Stanislav Brabec 2005-07-29 10:23:44 UTC
I guess you want NEEDINFO from Jens Axboe.
Comment 11 Jan Kara 2005-08-11 11:19:38 UTC
Further investigation shows that the command send to DVD-RAM returns error
0x8000002 (and hence ioctl() subsequently returns EIO).
Comment 12 Michael Gross 2005-10-04 12:21:32 UTC
Hello!

Due to the proposal from Andreas Jaeger, this bug will be closed as a reminder,
because it is in a NEEDINFO-state for longer than 4 weeks by now which indicates
a very low activity for this bug. If the required information is offered by the
respective person or the bug has in fact a wrong status, it should be reopened,
changing/correcting the information and a reason should be provided.

In any case there should be checked if this problem hasn't been already fixed or
is at least of relevance for the current release, specially for the older bugs
(<9.3).

Sorry if this causes you any inconvenience.

Kind regards,
the BNC-Screening-Team
Comment 13 Stanislav Brabec 2005-10-04 17:02:20 UTC
Not yet fixed, reopening.
Comment 14 Jens Axboe 2006-06-06 13:24:52 UTC
Should be ok for 10.1.
Comment 15 Stanislav Brabec 2006-06-18 01:04:54 UTC
Reopening. Experienced again in SuSE Linux 10.1. This time reproduced on i586 machine.

On cleanly formatted DVD-RAM (mkudffs --vid=BackupDVD1 --media-type=dvdram --utf8 /dev/hdc), ejected and inserted only as fresh formatted to automount it, then wrote 4GB, I again got:

Jun 18 02:41:22 utx kernel: udf: udf_read_inode(ino 558554) failed !bh
Jun 18 02:41:22 utx kernel: udf: udf_get_filelongad() invalidparms
Jun 18 02:41:37 utx kernel: udf: udf_read_inode(ino 558554) failed !bh
Jun 18 02:41:37 utx kernel: udf: udf_get_filelongad() invalidparms
Jun 18 02:41:37 utx kernel: udf: udf_read_inode(ino 558554) failed !bh

In 4th level of subdirectory I found corrupted directory. As root it returns:
utx:/media/BackupDVD1/sb/SuSE_Atari_Amiga/Atari_130XE # ls
/bin/ls: REST: Permission denied
BASIC  CHR  COM  ENT  FRACTAL  INDEX  MARS  PHOTO  ROMS  SRC  TOS  TOSMAN  TXT

And reading another corrupted medium with unfinished write, I got:
Jun 17 15:51:40 utx kernel: Unable to handle kernel paging request at virtual address bc60a4c0
Jun 17 15:51:40 utx kernel: f12aefd0
Jun 17 15:51:40 utx kernel: *pde = 00000000
Jun 17 15:51:40 utx kernel: Oops: 0000 [#1]
Jun 17 15:51:40 utx kernel: CPU:    0
Jun 17 15:51:40 utx kernel: EIP:    0060:[<f12aefd0>]    Tainted: G     U VLI
Jun 17 15:51:40 utx kernel: EFLAGS: 00010217   (2.6.16.13-4-default #1)
Jun 17 15:51:40 utx kernel: eax: 00000028   ebx: bc60a4c0   ecx: c4adbd5c   edx: 00000030
Jun 17 15:51:40 utx kernel: esi: a6508878   edi: e8b2f214   ebp: c4adbd44   esp: c4adbccc
Jun 17 15:51:40 utx kernel: ds: 007b   es: 007b   ss: 0068
Jun 17 15:51:40 utx kernel: Stack: <0>bc60a4c0 e8b2f1dc f12a58fc 00000001 c4adbd4c e8b2f214 c4adbd54 c4adbd44
Jun 17 15:51:40 utx kernel:        f12a59f5 c4adbd44 c4adbd58 c4adbd54 00000001 c4adbd5c e8b2f2ff e8b2f33c
Jun 17 15:51:40 utx kernel:        e8b2f1dc df4ca564 f12ae6c0 c4adbd44 c4adbd58 c4adbd54 00000001 e8b2f214
Jun 17 15:51:40 utx kernel: Call Trace:
Jun 17 15:51:40 utx kernel:  [<f12a58fc>] udf_current_aext+0xab/0x12d [udf]
Jun 17 15:51:40 utx kernel:  [<f12a59f5>] udf_next_aext+0x77/0x87 [udf]
Jun 17 15:51:40 utx kernel:  [<f12ae6c0>] udf_discard_prealloc+0x153/0x266 [udf]
Jun 17 15:51:40 utx kernel:  [<f12a8d7d>] udf_clear_inode+0x16/0x2a [udf]
Jun 17 15:51:40 utx kernel:  [<c015e6c8>] clear_inode+0xd5/0x102
Jun 17 15:51:40 utx kernel:  [<c015e8eb>] generic_drop_inode+0x101/0x114
Jun 17 15:51:40 utx kernel:  [<c015e26b>] iput+0x63/0x65
Jun 17 15:51:40 utx kernel:  [<f12a6d9c>] udf_iget+0x85/0x8c [udf]
Jun 17 15:51:40 utx kernel:  [<f12aa7cc>] udf_lookup+0x62/0x85 [udf]
Jun 17 15:51:40 utx kernel:  [<c01554a4>] do_lookup+0xa3/0x135
Jun 17 15:51:40 utx kernel:  [<c0156e6f>] __link_path_walk+0x6c1/0xab6
Jun 17 15:51:40 utx kernel:  [<c0164a2e>] setxattr+0xb0/0xbf
Jun 17 15:51:40 utx kernel:  [<c01572ab>] link_path_walk+0x47/0xb9
Jun 17 15:51:40 utx kernel:  [<c0157600>] do_path_lookup+0x1cc/0x1e6
Jun 17 15:51:40 utx kernel:  [<c015d3d4>] dput+0x31/0x122
Jun 17 15:51:40 utx kernel:  [<c01607b9>] mntput_no_expire+0x11/0x59
Jun 17 15:51:40 utx kernel:  [<c0164b08>] sys_setxattr+0x43/0x4f
Jun 17 15:51:40 utx kernel:  [<c01575cc>] do_path_lookup+0x198/0x1e6
Jun 17 15:51:40 utx kernel:  [<c0157d0e>] __user_walk_fd+0x29/0x3a
Jun 17 15:51:40 utx kernel:  [<c0152326>] vfs_lstat_fd+0x12/0x39
Jun 17 15:51:40 utx kernel:  [<c0157600>] do_path_lookup+0x1cc/0x1e6
Jun 17 15:51:40 utx kernel:  [<c015d3d4>] dput+0x31/0x122
Jun 17 15:51:40 utx kernel:  [<c01607b9>] mntput_no_expire+0x11/0x59
Jun 17 15:51:40 utx kernel:  [<c0164b08>] sys_setxattr+0x43/0x4f
Jun 17 15:51:40 utx kernel:  [<c0152392>] sys_lstat64+0xf/0x23
Jun 17 15:51:40 utx kernel:  [<c010580a>] do_syscall_trace+0x123/0x169
Jun 17 15:51:40 utx kernel:  [<c0102a39>] syscall_call+0x7/0xb
Jun 17 15:51:40 utx kernel: Code: c3 56 85 c0 53 89 d6 89 c3 74 04 85 c9 75 0f 68 40 0d 2b f1 e8 ba 8c e6 ce 31 c0 5e eb 21 8b 01 85 c0 78 19 8d 50 08 39 f2 77 12 <83> 3b 00 89 d8 74 0b 83 7c 24 0c 00 74 06 89 11 eb 02 31 c0 5b


>>EIP; f12aefd0 <pg0+30f1ffd0/3fc6f400>   <=====

>>ebx; bc60a4c0 <phys_startup_32+bc50a4c0/c0000000>
>>ecx; c4adbd5c <pg0+474cd5c/3fc6f400>
>>esi; a6508878 <phys_startup_32+a6408878/c0000000>
>>edi; e8b2f214 <pg0+287a0214/3fc6f400>
>>ebp; c4adbd44 <pg0+474cd44/3fc6f400>
>>esp; c4adbccc <pg0+474cccc/3fc6f400>

Trace; f12a58fc <pg0+30f168fc/3fc6f400>
Trace; f12a59f5 <pg0+30f169f5/3fc6f400>
Trace; f12ae6c0 <pg0+30f1f6c0/3fc6f400>
Trace; f12a8d7d <pg0+30f19d7d/3fc6f400>
Trace; c015e6c8 <clear_inode+d5/102>
Trace; c015e8eb <generic_drop_inode+101/114>
Trace; c015e26b <iput+63/65>
Trace; f12a6d9c <pg0+30f17d9c/3fc6f400>
Trace; f12aa7cc <pg0+30f1b7cc/3fc6f400>
Trace; c01554a4 <do_lookup+a3/135>
Trace; c0156e6f <__link_path_walk+6c1/ab6>
Trace; c0164a2e <setxattr+b0/bf>
Trace; c01572ab <link_path_walk+47/b9>
Trace; c0157600 <do_path_lookup+1cc/1e6>
Trace; c015d3d4 <dput+31/122>
Trace; c01607b9 <mntput_no_expire+11/59>
Trace; c0164b08 <sys_setxattr+43/4f>
Trace; c01575cc <do_path_lookup+198/1e6>
Trace; c0157d0e <__user_walk_fd+29/3a>
Trace; c0152326 <vfs_lstat_fd+12/39>
Trace; c0157600 <do_path_lookup+1cc/1e6>
Trace; c015d3d4 <dput+31/122>
Trace; c01607b9 <mntput_no_expire+11/59>
Trace; c0164b08 <sys_setxattr+43/4f>
Trace; c0152392 <sys_lstat64+f/23>
Trace; c010580a <do_syscall_trace+123/169>
Trace; c0102a39 <syscall_call+7/b>

This architecture has variable length instructions, decoding before eip
is unreliable, take these instructions with a pinch of salt.

Code;  f12aefa5 <pg0+30f1ffa5/3fc6f400>
00000000 <_EIP>:
Code;  f12aefa5 <pg0+30f1ffa5/3fc6f400>
   0:   c3                        ret
Code;  f12aefa6 <pg0+30f1ffa6/3fc6f400>
   1:   56                        push   %esi
Code;  f12aefa7 <pg0+30f1ffa7/3fc6f400>
   2:   85 c0                     test   %eax,%eax
Code;  f12aefa9 <pg0+30f1ffa9/3fc6f400>
   4:   53                        push   %ebx
Code;  f12aefaa <pg0+30f1ffaa/3fc6f400>
   5:   89 d6                     mov    %edx,%esi
Code;  f12aefac <pg0+30f1ffac/3fc6f400>
   7:   89 c3                     mov    %eax,%ebx
Code;  f12aefae <pg0+30f1ffae/3fc6f400>
   9:   74 04                     je     f <_EIP+0xf>
Code;  f12aefb0 <pg0+30f1ffb0/3fc6f400>
   b:   85 c9                     test   %ecx,%ecx
Code;  f12aefb2 <pg0+30f1ffb2/3fc6f400>
   d:   75 0f                     jne    1e <_EIP+0x1e>
Code;  f12aefb4 <pg0+30f1ffb4/3fc6f400>
   f:   68 40 0d 2b f1            push   $0xf12b0d40
Code;  f12aefb9 <pg0+30f1ffb9/3fc6f400>
  14:   e8 ba 8c e6 ce            call   cee68cd3 <_EIP+0xcee68cd3>
Code;  f12aefbe <pg0+30f1ffbe/3fc6f400>
  19:   31 c0                     xor    %eax,%eax
Code;  f12aefc0 <pg0+30f1ffc0/3fc6f400>
  1b:   5e                        pop    %esi
Code;  f12aefc1 <pg0+30f1ffc1/3fc6f400>
  1c:   eb 21                     jmp    3f <_EIP+0x3f>
Code;  f12aefc3 <pg0+30f1ffc3/3fc6f400>
  1e:   8b 01                     mov    (%ecx),%eax
Code;  f12aefc5 <pg0+30f1ffc5/3fc6f400>
  20:   85 c0                     test   %eax,%eax
Code;  f12aefc7 <pg0+30f1ffc7/3fc6f400>
  22:   78 19                     js     3d <_EIP+0x3d>
Code;  f12aefc9 <pg0+30f1ffc9/3fc6f400>
  24:   8d 50 08                  lea    0x8(%eax),%edx
Code;  f12aefcc <pg0+30f1ffcc/3fc6f400>
  27:   39 f2                     cmp    %esi,%edx
Code;  f12aefce <pg0+30f1ffce/3fc6f400>
  29:   77 12                     ja     3d <_EIP+0x3d>

This decode from eip onwards should be reliable

Code;  f12aefd0 <pg0+30f1ffd0/3fc6f400>
00000000 <_EIP>:
Code;  f12aefd0 <pg0+30f1ffd0/3fc6f400>   <=====
   0:   83 3b 00                  cmpl   $0x0,(%ebx)   <=====
Code;  f12aefd3 <pg0+30f1ffd3/3fc6f400>
   3:   89 d8                     mov    %ebx,%eax
Code;  f12aefd5 <pg0+30f1ffd5/3fc6f400>
   5:   74 0b                     je     12 <_EIP+0x12>
Code;  f12aefd7 <pg0+30f1ffd7/3fc6f400>
   7:   83 7c 24 0c 00            cmpl   $0x0,0xc(%esp)
Code;  f12aefdc <pg0+30f1ffdc/3fc6f400>
   c:   74 06                     je     14 <_EIP+0x14>
Code;  f12aefde <pg0+30f1ffde/3fc6f400>
   e:   89 11                     mov    %edx,(%ecx)
Code;  f12aefe0 <pg0+30f1ffe0/3fc6f400>
  10:   eb 02                     jmp    14 <_EIP+0x14>
Code;  f12aefe2 <pg0+30f1ffe2/3fc6f400>
  12:   31 c0                     xor    %eax,%eax
Code;  f12aefe4 <pg0+30f1ffe4/3fc6f400>
  14:   5b                        pop    %ebx
Comment 16 Jeff Mahoney 2006-06-20 20:57:33 UTC
Stanislav - Since 10.1 no longer uses subfs, I'd like to start at the beginning again. Are the steps to reproduce still the same?
Comment 17 Stanislav Brabec 2006-06-21 10:49:41 UTC
The lost buffer is not reproducible and not reproducible by an easy way in the top of this report. I wrote the same data second time and the error did not occur. For SuSE 10.0 this problem appeared during nearly every writing and the DVD-RAM with UDF was unusable. In 10.1, I have experienced this problem only once during writing of about 10 x 4G.

But the filesystem corruption is the same: The directory should contain another directory. Instead of it, it contains undeletable object of unknown type.

I am using rsync for creating DVD-RAM back-ups.

This time the corruption appeared without ejecting the media (it was ejected only when it was empty). Syslog did not show any IDE or I/O error in time of writing. Corruption appeared in the middle of writing.

Possible blind guesses:
- race while renaming file and doing mkdir in the same directory at once
- random buffer overflow for long or international file names
- fuzzy buffers (trashed before write or written outdated)

The Oops is reproducible, if you have properly corrupted filesystem.
Comment 18 Stanislav Brabec 2006-07-03 21:30:08 UTC
Making another back-up of the same directory, the error occurred nearly in the same place, now the one level upper is not writable ("permission denied" error) nor deletable ("directory not empty" error).

Now the error occurs one inode later.
Jul  3 02:34:00 utx kernel: udf: udf_read_inode(ino 558555) failed !bh

I moved two corrupted inodes to a separate directory and tried to delete them:
rm -rf /media/BackupDVD1/BAD
rm: cannot remove directory `/media/BackupDVD1/BAD/tool-options': Directory not empty
rm: cannot lstat `/media/BackupDVD1/BAD/SuSE_Atari_Amiga/Atari_130XE': Permission denied
Comment 19 Stanislav Brabec 2006-07-04 19:10:06 UTC
Making a "ls -alRi --sort=none /media/BackupDVD1", corrupted item was the last directory in the list of each "session". "sync" command was called before ejecting the disc.
Comment 20 Stanislav Brabec 2006-07-27 09:58:52 UTC
It seems, that the corruption is caused by a "sync" command in some circumstances. After removing it from the backup script, I have not experienced the corruption.
Comment 21 Jens Axboe 2006-09-12 13:06:44 UTC
Reopen for 10.1 if the bug still exists.
Comment 22 Stanislav Brabec 2006-09-13 12:22:15 UTC
Yes, it still exists in 10.1 and comments #15 and below are related to 10.1.

Note to the previous comments: The corruption occurs much more frequent, if "sync" command is called before ejecting the disc, but it was already experienced several times even without calling the "sync".
Comment 23 Stanislav Brabec 2006-09-29 13:50:42 UTC
Corruption experienced once again. This time it was after "disc full" error using rsync.
Comment 24 Greg Kroah-Hartman 2006-12-19 23:16:10 UTC
Does this still happen on 10.2?  If so, please reopen with the new log messages.
Comment 25 Stanislav Brabec 2007-03-05 11:34:22 UTC
Still happens in openSUSE 10.2:

hammer:~ # ls -al "/media/BackupDVD1/sb_deleted/.wine/"
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/dosdevices: Permission denied
total 430
drwxr-xr-x  4 root users    240 Oct  4 11:04 .
drwxr-xr-x 10 root users   4028 Feb 27 10:52 ..
??????????  ? ?    ?          ?            ? dosdevices
drwxr-xr-x  4 root users    204 Dec 28 21:20 drive_c
-rw-------  1 root users 366067 Jan 10 23:51 system.reg
-rw-------  1 root users  52825 Jan 11 01:08 user.reg
hammer:~ # ls -al "/media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files"
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/Winwap Technologies: Permission denied
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/Neat Image: Permission denied
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/InstallShield Installation Information: Permission denied
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/FunTom: Permission denied
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/FreeUndelete: Permission denied
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/DiMAGE Viewer: Permission denied
ls: cannot access /media/BackupDVD1/sb_deleted/.wine/drive_c/Program Files/ASUS: Permission denied
total 0
drwxr-xr-x 10 root users 480 Dec 28 21:20 .
drwxr-xr-x  4 root users 204 Dec 28 21:20 ..
??????????  ? ?    ?       ?            ? ASUS
drwxr-xr-x  3 root users  92 Dec 28 21:20 Common Files
??????????  ? ?    ?       ?            ? DiMAGE Viewer
??????????  ? ?    ?       ?            ? FreeUndelete
??????????  ? ?    ?       ?            ? FunTom
??????????  ? ?    ?       ?            ? InstallShield Installation Information
??????????  ? ?    ?       ?            ? Neat Image
??????????  ? ?    ?       ?            ? Winwap Technologies

Notes:
- Medium was never inserted to any OS except openSUSE 10.2.
- Backup process was interrupted, after reboot I continued with backing-up.
- My backup script uses rsync --delete -b --backup-dir=/media/BackupDVD1/sb_deleted -v -aHS
- I did "mount -oremount,noatime /dev/sr0" during work.
Comment 26 Stanislav Brabec 2007-03-05 11:38:34 UTC
Syslog shows:
Mar  5 12:08:18 hammer kernel: UDF-fs INFO UDF 0.9.8.1 (2004/29/09) Mounting volume 'BackupDVD1', timestamp 2007/02/26 07:14 (103c)
Mar  5 12:16:45 hammer kernel: udf: udf_read_inode(ino 1935807) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935983) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935811) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935844) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935848) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935880) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935919) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935920) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935954) failed !bh
Mar  5 12:16:46 hammer kernel: udf: udf_read_inode(ino 1935956) failed !bh
Mar  5 12:16:47 hammer kernel: udf: udf_read_inode(ino 1936221) failed !bh
Mar  5 12:18:19 hammer kernel: udf: udf_read_inode(ino 1935807) failed !bh
Mar  5 12:18:39 hammer kernel: udf: udf_read_inode(ino 1935807) failed !bh
Mar  5 12:18:48 hammer kernel: udf: udf_read_inode(ino 1935807) failed !bh
Mar  5 12:28:59 hammer kernel: udf: udf_read_inode(ino 1935807) failed !bh

No other directories except upper mentioned were corrupted during backup of full 4.6GB and several thousand directories.
Comment 29 Stanislav Brabec 2007-04-27 15:43:18 UTC
This bug is hard to reproduce, but easily to experience.

If I want to experience it, I am simply using my home-backup script.

If you use it daily, often stop by Ctrl+C if and then continue while still working on the desktop, you will experience this corruption in a few days.

rsync --delete -b --backup-dir=/media/test_dvdram/home_deleted -v -aHS . /media/test/dvdram/home
Comment 30 Tony Jones 2007-04-27 15:58:35 UTC
I'm going to try taking a look at it next week.  Thanks for the testcase, this is pretty much what I'd distilled from the previous comments also.
Comment 31 Stanislav Brabec 2007-04-30 10:53:22 UTC
And an additional info: Both DVD+-RW/RAM units where I experienced this bug are manufactured by LG.

There is known problem of some LG firmwares, that in some special conditions they may trash buffers, e. g. problems reported with cdrecord sg0/sr0 locking clash problem (status check commands issued to the drive break running burning process).
Comment 32 Tony Jones 2008-04-24 08:10:14 UTC
Stanislav, have you tried using a different form of drive?  I note in comment #31 that there is possible concern over bad LG firmware.   I don't have this drive and have been unable to reproduce using the hardware I have,
Comment 33 Stanislav Brabec 2008-04-24 09:23:41 UTC
No. All drives I am using, are LG. I can provide you access to one.

VladimĂ­r may know more about the LG drives problem in cdrecord.
Comment 34 Vladimir Nadvornik 2008-04-24 09:29:50 UTC
It was bug #226019
Comment 35 Tony Jones 2008-04-24 09:38:21 UTC
> I can provide you access to one.
I'm located in the US, might be problematic but I'll let you decide this.
Comment 36 Tony Jones 2008-05-08 13:48:05 UTC
Stanislav, since I'm unable to reproduce this on my hardware, should I reassign to someone geographically in same area as you?   Don't see you sending the drive to the United States as being particularly practical.
Comment 37 Stanislav Brabec 2008-05-08 15:21:05 UTC
I just encountered very similar breakage on USB flash while working around bug 388248 on openSUSE 10.3 i586.

How to reproduce:

1. Insert USB flash

2. Mount manually (I used "mount -tvfat -onoexec,nosuid,async,nodev,gid=100,umask=0002,utf8=true,quiet,rw /dev/sda1 /mnt/", but it seems to be reproducible with other arguments as well.)

3. mkdir /mnt/testdir

4. sync

5. umount /mnt

6. sync

7. Wait for sync finished. Wait few additional seconds. Remove the USB flash

Now insert again and you see corrupted data.

Note that only manual umount is affected. If I am using GNOME volume manager to unmount the medium, I never seen the corruption.

I don't know GNOME volume manager exactly. Is there any other action, which is not performed when doing steps above?
Comment 38 Jan Kara 2008-05-08 16:31:46 UTC
I'm reluctant to believe this ;) Can you show me on Monday at work? I'd suspect some HW or userspace issue because I haven't seen any other reports of similar problems...
Comment 39 Stanislav Brabec 2008-05-12 19:35:35 UTC
I was not able to reproduce the problem with an arbitrary flash, but when I restored the original image back, the I got the corruption again.

Steps to reproduce: see comment 37

ls -al /mnt/
ls: cannot access /mnt/testdir: Input/output error
total 8
drwxrwxr-x  3 root users 4096 Jan  1  1970 .
drwxr-xr-x 25 root root  4096 May 12 20:58 ..
d?????????  ? ?    ?        ?            ? testdir

Note that writing large amount of data to this flash works. Only last few blocks are lost.

Syslog contents:
May 12 21:26:38 utx kernel: usb 4-6: new high speed USB device using ehci_hcd and address 11
May 12 21:26:38 utx kernel: usb 4-6: new device found, idVendor=1307, idProduct=0163
May 12 21:26:38 utx kernel: usb 4-6: new device strings: Mfr=1, Product=2, SerialNumber=3
May 12 21:26:38 utx kernel: usb 4-6: Product: USB Flash Drive
May 12 21:26:38 utx kernel: usb 4-6: Manufacturer: USB 2.0
May 12 21:26:38 utx kernel: usb 4-6: SerialNumber: 789485f5f19059
May 12 21:26:38 utx kernel: usb 4-6: configuration #1 chosen from 1 choice
May 12 21:26:38 utx kernel: scsi7 : SCSI emulation for USB Mass Storage devices
May 12 21:26:38 utx kernel: usb-storage: device found at 11
May 12 21:26:38 utx kernel: usb-storage: waiting for device to settle before scanning
May 12 21:26:39 utx kernel: scsi 7:0:0:0: Direct-Access     USB 2.0  USB Flash Drive  0.00 PQ: 0 ANSI: 2
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] 7897087 512-byte hardware sectors (4043 MB)
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] Write Protect is off
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] Mode Sense: 00 00 00 00
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] Assuming drive cache: write through
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] 7897087 512-byte hardware sectors (4043 MB)
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] Write Protect is off
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] Mode Sense: 00 00 00 00
May 12 21:26:39 utx kernel: sd 7:0:0:0: [sda] Assuming drive cache: write through
May 12 21:26:40 utx kernel:  sda: sda1
May 12 21:26:40 utx kernel: sd 7:0:0:0: [sda] Attached SCSI removable disk
May 12 21:26:40 utx kernel: sd 7:0:0:0: Attached scsi generic sg0 type 0
May 12 21:26:40 utx kernel: usb-storage: device scan complete
May 12 21:28:01 utx kernel: FAT: Did not find valid FSINFO signature.
May 12 21:28:01 utx kernel:      Found signature1 0x00000000 signature2 0x00000000 (sector = 1)
May 12 21:28:03 utx kernel: FAT: Did not find valid FSINFO signature.
May 12 21:28:03 utx kernel:      Found signature1 0x00000000 signature2 0x00000000 (sector = 1)
May 12 21:28:11 utx kernel: FAT: Did not find valid FSINFO signature.
May 12 21:28:11 utx kernel:      Found signature1 0x00000000 signature2 0x00000000 (sector = 1)
May 12 21:28:28 utx kernel: usb 4-6: USB disconnect, address 11
May 12 21:28:37 utx kernel: usb 4-6: new high speed USB device using ehci_hcd and address 12
May 12 21:28:37 utx kernel: usb 4-6: new device found, idVendor=1307, idProduct=0163
May 12 21:28:37 utx kernel: usb 4-6: new device strings: Mfr=1, Product=2, SerialNumber=3
May 12 21:28:37 utx kernel: usb 4-6: Product: USB Flash Drive
May 12 21:28:37 utx kernel: usb 4-6: Manufacturer: USB 2.0
May 12 21:28:37 utx kernel: usb 4-6: SerialNumber: 789485f5f19059
May 12 21:28:37 utx kernel: usb 4-6: configuration #1 chosen from 1 choice
May 12 21:28:37 utx kernel: scsi8 : SCSI emulation for USB Mass Storage devices
May 12 21:28:37 utx kernel: usb-storage: device found at 12
May 12 21:28:37 utx kernel: usb-storage: waiting for device to settle before scanning
May 12 21:28:38 utx kernel: scsi 8:0:0:0: Direct-Access     USB 2.0  USB Flash Drive  0.00 PQ: 0 ANSI: 2
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] 7897087 512-byte hardware sectors (4043 MB)
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Write Protect is off
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Mode Sense: 00 00 00 00
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Assuming drive cache: write through
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] 7897087 512-byte hardware sectors (4043 MB)
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Write Protect is off
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Mode Sense: 00 00 00 00
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Assuming drive cache: write through
May 12 21:28:38 utx kernel:  sdb: sdb1
May 12 21:28:38 utx kernel: sd 8:0:0:0: [sdb] Attached SCSI removable disk
May 12 21:28:38 utx kernel: sd 8:0:0:0: Attached scsi generic sg0 type 0
May 12 21:28:38 utx kernel: usb-storage: device scan complete
May 12 21:28:52 utx kernel: FAT: Did not find valid FSINFO signature.
May 12 21:28:52 utx kernel:      Found signature1 0x00000000 signature2 0x00000000 (sector = 1)
May 12 21:28:55 utx smartd[3593]: Device: /dev/hda, SMART Usage Attribute: 190 Temperature_Celsius changed from 67 to 56
May 12 21:28:55 utx smartd[3593]: Device: /dev/hda, SMART Usage Attribute: 194 Temperature_Celsius changed from 33 to 44
May 12 21:28:55 utx smartd[3593]: Device: /dev/hda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 70 to 66
May 12 21:28:56 utx kernel: FAT: Filesystem panic (dev sdb1)
May 12 21:28:56 utx kernel:     fat_get_cluster: invalid cluster chain (i_pos 0)
May 12 21:28:56 utx kernel:     File system has been set read-only
May 12 21:31:40 utx su: (to nobody) root on none
May 12 21:31:43 utx syslog-ng[2072]: last message repeated 2 times
Comment 40 Stanislav Brabec 2008-05-12 19:37:24 UTC
Created attachment 214511 [details]
blktrace.out.bz2

blktrace output (binary form).
Comment 41 Stanislav Brabec 2008-06-22 21:43:01 UTC
Just seen another VFAT corruptions on openSUSE 11.0 (i586 and x86_64). Reproduced on another vendor formatted medium (A-Data 2G SD+USB card in USB mode). Medium was accepted by HAL and mounted, files were copied. When reading them, tails of copied archives were corrupted (truncated).

Device was umounted via GNOME desktop device icon context menu in both cases.

Syslog shows no errors.
Comment 42 Stanislav Brabec 2008-06-26 10:00:04 UTC
I did a research on this corruption:

- copy file to the problematic flash
- compare files - the result is OK
- umount ; sync (or use GNOME desktop unmount)
- dump the whole contents of the flash
- eject and re-insert
- mount (or wait for mounting)
- dump the whole contents of the flash
- compare files - somewhere in the last few megabytes, there is a corrupted block
- compare dumps - there is 2048 different bytes

So I am starting to believe, that it is not a problem of the filesystem.
Comment 43 Stanislav Brabec 2008-06-30 12:04:37 UTC
The flash problem was added to the kernel database as http://bugzilla.kernel.org/show_bug.cgi?id=11012

I did not reproduce the DVD-RAM problem since update to openSUSE 11.0.
Comment 44 Tony Jones 2008-10-31 21:08:24 UTC
Based on comment #43 "I did not reproduce the DVD-RAM problem since update to openSUSE 11.0" I'm proposing closing this bug.  You are free to reopen if it recoccurs.
Comment 45 Stanislav Brabec 2008-11-03 13:17:38 UTC
I agree.