Bug 680113 - bad mtime handling of atjobs dir causes delayed execution of jobs
bad mtime handling of atjobs dir causes delayed execution of jobs
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE 11.4
Classification: openSUSE
Component: Basesystem
Factory
All All
: P5 - None : Normal (vote)
: ---
Assigned To: Vítězslav Čížek
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-03-16 11:25 UTC by Ingo Schwarze
Modified: 2011-04-28 11:56 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
fix: rescan the atjobs dir whenever its mtime changed (636 bytes, patch)
2011-03-16 11:25 UTC, Ingo Schwarze
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ingo Schwarze 2011-03-16 11:25:07 UTC
Created attachment 419646 [details]
fix: rescan the atjobs dir whenever its mtime changed

User-Agent:       Mozilla/5.0 (X11; U; OpenBSD i386; en-US; rv:1.9.2.13) Gecko/20110216 Firefox/3.6.13

When a running /usr/sbin/atd process did a cycle through run_loop()
while the mtime of the atjobs directory happened to be set to a
future time, this atd process will no more execute any at jobs until
the system time catches up to the time that happened to be set as
the mtime of the atjobs directory.  Both existing and newly created
at jobs are affected and get delayed.

Reproducible: Always

Steps to Reproduce:
1. Schedule an at job to be run in ten minutes.
2. Set the mtime of the atjobs directory to one hour in the future, for example using the -t option of touch(1).
2. Restart the atd(8).
3. Schedule an at job to be run immediately.
Actual Results:  
Execution of both jobs will be delayed for at least one hour.
Neither touching the atjobs directory nor restarting the atd once again is going to help.
To work around the problem, you have to *both* touch the atjobs directory *and* restart the atd after touching it.

Expected Results:  
Changes of the mtime of the atjobs directory should always cause the atd to rescan the directory, no matter in which direction the mtime changed.

Submitted upstream to Debian on May 14, 2009:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528671

Patch was completely ignored by Debian so far.
Unless I missed something, I got absolutely zero feedback.
Comment 1 Ingo Schwarze 2011-03-16 11:27:45 UTC
By the way, we are using this patch in production in the stable version of the Astaro Security Gateway (www.astaro.com) for more than a year, which means it is deployed on tens of thousands of machines around the world.
Comment 2 Vítězslav Čížek 2011-04-15 15:47:45 UTC
(In reply to comment #0)

I can't reproduce this on my 11.4 box.

> Steps to Reproduce:
> 1. Schedule an at job to be run in ten minutes.
$ echo "command" | at +5 min
$ atq
3	2011-04-15 17:31 a cizek

> 2. Set the mtime of the atjobs directory to one hour in the future, for example
> using the -t option of touch(1).
$ sudo touch -t 04151755 /var/spool/atjobs
$ ls -ld /var/spool/atjobs                
drwx------. 2 at at 4096 Apr 15  2011 /var/spool/atjobs/

> 2. Restart the atd(8).
$ sudo rcatd restart
$ atq
3	2011-04-15 17:31 a cizek

> 3. Schedule an at job to be run immediately.
$ echo "command" | at now

The command gets executed immediately.

> Actual Results:  
> Execution of both jobs will be delayed for at least one hour.

Actually all the jobs get executed right on time.

Did I miss something, when I was trying to reproduce?
Comment 3 Ingo Schwarze 2011-04-18 16:12:58 UTC
(In reply to comment #2)

Oh indeed, at-3.1.8-massive_batch.patch, which didn't exist last time
i looked in Debian, hides the bug - as opposed to fixing it.

The variable nothing_to_do intends to avoid iteration of the spool
directory each time CHECK_INTERVAL expires when it is known that last
time, there were no waiting jobs, and in the meantime, the directory
did not change - which does not work because the directory access time
going backwards is misinterpreted as "no change".

Now, with at-3.1.8-massive_batch.patch, each time at(1) adds a job to the
spool directory, nothing_to_do is cleared by the signal, such that as long
as at(1) is used, and signals don't get lost, checking the access time of
the spool directory is rendered irrelevant by
the at-3.1.8-massive_batch.patch.

However, with a bit of manipulation, it is still possible to
reproduce the bug:

 - schedule a job in 5 minutes
 - move the spool file to /tmp
   # required such that the spool dir becomes empty
 - set the spool dir time to the future, e.g. by one day
 - restart the atd
   # it finds the spool dir empty, sets the flag nothing_to_do,
   # and goes to sleep for CHECK_INTERVAL
 - move the spool file back to the spool dir
   # that sets the spool dir change time back to the correct time
 - wait for the scheduled job time to come and send a SIGCONT signal -
   NOT a SIGHUP signal, of course! - to the atd (or equivalently, just 
   wait for the CHECK_INTERVAL to expire)
   # during the upcoming cycle, the atd sees that the spool dir time 
   # is LESS than the one it remembers, hence does *not* iterate the 
   # spool dir and goes right back to sleep, even though a job is due
 

Here is a protocol of that test:

ischwarze04:/root # rpm -q at
at-3.1.8-1069.15.53
ischwarze04:/root # date
Mon Apr 18 17:52:23 CEST 2011
ischwarze04:/root # echo "command" | at +5min
warning: commands will be executed using /bin/sh
job 10 at 2011-04-18 17:57
ischwarze04:/root # atq
10      2011-04-18 17:57 a root
ischwarze04:/root # mv /var/spool/atjobs/a0000a014b67fd /tmp/
ischwarze04:/root # touch -t 04191800 /var/spool/atjobs
ischwarze04:/root # ls -ld /var/spool/atjobs
drwx------ 2 at at 4096 2011-04-19 18:00 /var/spool/atjobs
ischwarze04:/root # /etc/init.d/atd restart
Shutting down service at daemon                       done
Starting service at daemon                            done
ischwarze04:/root # ls -ld /var/spool/atjobs
drwx------ 2 at at 4096 2011-04-19 18:00 /var/spool/atjobs
ischwarze04:/root # mv /tmp/a0000a014b67fd /var/spool/atjobs/
ischwarze04:/root # ls -ld /var/spool/atjobs
drwx------ 2 at at 4096 2011-04-18 17:54 /var/spool/atjobs
ischwarze04:/root # atq
10      2011-04-18 17:57 a root
ischwarze04:/root # date
Mon Apr 18 17:54:40 CEST 2011
ischwarze04:/root # date
Mon Apr 18 17:58:21 CEST 2011
ischwarze04:/root # atq
10      2011-04-18 17:57 a root
ischwarze04:/root # kill -CONT 23684
ischwarze04:/root # atq
10      2011-04-18 17:57 a root

Here is the associated strace snippet, showing that a run_loop is
triggered by the SIGCONT, but the atd does not even try to look
into the spool directory:

ischwarze04:/root # strace -p 23684
Process 23684 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>
) = ? ERESTART_RESTARTBLOCK (To be restarted)
--- SIGCONT (Continued) @ 0 (0) ---
restart_syscall(<... resuming interrupted call ...>

The bug is now hidden so well that is no longer of much
practical relevance, but it is still a bug.

Changing the comparison operator from "<=" to "==" as suggested
in my old patch makes sure that the atd correctly identifies a
directory whose atime goes backwards as a changing directory and
does not wrongly consider it as unchanged.

Thanks for checking,
Ingo
Comment 4 Vítězslav Čížek 2011-04-19 07:55:34 UTC
Thanks, Ingo, for your deep analysis of the bug behaviour.

The conditions to trigger this are now pretty ridiculous.
But I see no reason not to apply your patch.
Comment 5 Vítězslav Čížek 2011-04-21 07:00:19 UTC
Submitted the fix to factory.
Comment 6 Bernhard Wiedemann 2011-04-28 11:56:40 UTC
This is an autogenerated message for OBS integration:
This bug (680113) was mentioned in
https://build.opensuse.org/request/show/67877