Bug 957816

Summary: system lags a lot while copying to slow devices
Product: [openSUSE] openSUSE Tumbleweed Reporter: Stanislav Brabec <sbrabec>
Component: KernelAssignee: Jan Kara <jack>
Status: NEW --- QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: mgorman, mhocko
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: SUSE Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: dmesg SysRq-w dump while soft freeze occurs

Description Stanislav Brabec 2015-12-03 15:58:36 UTC
Created attachment 658269 [details]
dmesg SysRq-w dump while soft freeze occurs

While copying large files to slow devices (DVD-RAM, USB 2.0 Flash discs and USB 2.0 attached hard discs), desktop experiences soft freezes.

Depending on the device and files written, the freeze could last from fractions of seconds to many seconds. It affects either particular application (e. g. all terminals from on factory if rsync runs in terminal), all applications, or even freezes mouse movement.

It was previously reported as bug 133718, but it still persists in some form.

Attached SysRq-w dump shows such a short freeze (XFCE Terminal frozen for several seconds).
Comment 1 Stanislav Brabec 2015-12-03 16:01:46 UTC
Note that this problem is strongly masked on FAT based drivers by a default "flush" option. To reproduce on FAT formatted flashes, you need to mount it without "flush".
Comment 2 Jan Kara 2015-12-14 11:27:24 UTC
So the reason for the stalls seems to be that page faults end up in direct reclaim waiting for IO. What is your /proc/sys/vm/dirty_ratio? How much memory does your machine have?

Could you sample /proc/meminfo say every second while the copy is running and attach it here when the stall happens? Mel any other idea what to look at?
Comment 3 Mel Gorman 2015-12-14 13:12:51 UTC
(In reply to Jan Kara from comment #2)
> So the reason for the stalls seems to be that page faults end up in direct
> reclaim waiting for IO. What is your /proc/sys/vm/dirty_ratio? How much
> memory does your machine have?
> 
> Could you sample /proc/meminfo say every second while the copy is running
> and attach it here when the stall happens? Mel any other idea what to look
> at?

Altering dirty_ratio is certainly one option although it's possible it'll simply defer the problem.

The wait_iff_congested is only meant to trigger when dirty or under-writeback pages are reaching the end of the LRU multiple times and the device is congested. If it's a case that most or all dirty files are really backed by this device then the stall will trigger.

There is some anecdotal evidence upstream that this problem is worse on recent kernels than it used to be. I do not recall the specifics unfortunately but Michal was working in that area so it should be fresh in his mind. Michal?

An extreme workaround would be to use cgcreate and cgset to create a memory-limited cgroup and run cp within that cgroup with cgexec. That would prevent too much memory being dirtied by the slow storage but it's not desirable as a general solution.
Comment 4 Stanislav Brabec 2015-12-14 14:32:27 UTC
cat /proc/sys/vm/dirty_ratio
20

This is the default Tumbleweed kernel 4.3.0-2-default with the default setup.

The machine has 8 GB RAM and 4 GB swap.
This is a situation when no copy is running.
free
              total        used        free      shared  buff/cache   available
Mem:        8100000     3444452      176580      185048     4478968     4318020
Swap:       4184768      994068     3190700


The Flash has 128GB and it is USB 2.0.

I'll prepare reproducer and post the meminfo log.