Bug 1216858 - python-eventlet tests fail with kernel 6.6
Summary: python-eventlet tests fail with kernel 6.6
Status: RESOLVED UPSTREAM
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Jiri Slaby
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-03 09:44 UTC by Jiri Slaby
Modified: 2023-11-27 14:00 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
jslaby: needinfo? (p.drouand)
jslaby: needinfo? (dmueller)


Attachments
BAD strace -rT -e trace=network,/epoll (4.38 KB, text/plain)
2023-11-03 09:44 UTC, Jiri Slaby
Details
GOOD strace -rT -e trace=network,/epoll (2.86 KB, text/plain)
2023-11-03 09:45 UTC, Jiri Slaby
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jiri Slaby 2023-11-03 09:44:50 UTC
Created attachment 870621 [details]
BAD strace -rT -e trace=network,/epoll

Currently at:
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:H/python-eventlet/standard/x86_64

> =================================== FAILURES ===================================
> _______________________ TestGreenSocket.test_full_duplex _______________________
>
> self = <tests.greenio_test.TestGreenSocket testMethod=test_full_duplex>
>
>     def test_full_duplex(self):
> ...
> >       large_evt.wait()
>
> tests/greenio_test.py:424:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> eventlet/greenthread.py:181: in wait
>     return self._exit_event.wait()
> eventlet/event.py:125: in wait
>     result = hub.switch()
...
> E       tests.TestIsTakingTooLong: 1
>
> eventlet/hubs/hub.py:313: TestIsTakingTooLong

to this commit. With the commit, the test takes > 1.5 s. Without the commit it takes only < 300 ms. And they set timeout to 1 s.

The reduced self-stadning test case:
#!/usr/bin/python3
import eventlet
from eventlet.green import select, socket, time, ssl

def bufsized(sock, size=1):
    """ Resize both send and receive buffers on a socket.
    Useful for testing trampoline.  Returns the socket.

    >>> import socket
    >>> sock = bufsized(socket.socket(socket.AF_INET, socket.SOCK_STREAM))
    """
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, size)
    sock.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, size)
    return sock

def min_buf_size():
    """Return the minimum buffer size that the platform supports."""
    test_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    test_sock.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 1)
    return test_sock.getsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF)

def test_full_duplex():
    large_data = b'*' * 10 * min_buf_size()
    listener = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    listener.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
    listener.bind(('127.0.0.1', 0))
    listener.listen(50)
    bufsized(listener)

    def send_large(sock):
        sock.sendall(large_data)

    def read_large(sock):
        result = sock.recv(len(large_data))
        while len(result) < len(large_data):
            result += sock.recv(len(large_data))
        assert result == large_data

    def server():
        (sock, addr) = listener.accept()
        sock = bufsized(sock)
        send_large_coro = eventlet.spawn(send_large, sock)
        eventlet.sleep(0)
        result = sock.recv(10)
        expected = b'hello world'
        while len(result) < len(expected):
            result += sock.recv(10)
        assert result == expected
        send_large_coro.wait()

    server_evt = eventlet.spawn(server)
    client = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client.connect(('127.0.0.1', listener.getsockname()[1]))
    bufsized(client)
    large_evt = eventlet.spawn(read_large, client)
    eventlet.sleep(0)
    client.sendall(b'hello world')
    server_evt.wait()
    large_evt.wait()
    client.close()

test_full_duplex()

=====================================

I bisected it to:
commit dfa2f0483360d4d6f2324405464c9f281156bd87
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jul 17 15:29:17 2023 +0000

    tcp: get rid of sysctl_tcp_adv_win_scale
Comment 1 Jiri Slaby 2023-11-03 09:45:16 UTC
Created attachment 870622 [details]
GOOD strace -rT -e trace=network,/epoll
Comment 2 Jiri Slaby 2023-11-03 10:01:33 UTC
Upstream report:
https://lore.kernel.org/all/738bb6a1-4e99-4113-9345-48eea11e2108@kernel.org/

Eric says:
Retracting TCP windows has always been problematic.

If we really want to be very gentle, this could add more logic,
shorter timer events for pathological cases like that,
I am not sure this is really worth it, especially if dealing with one
million TCP sockets in this state.
Comment 3 Michal Kubeček 2023-11-03 10:06:09 UTC
How about my suggestion to set SO_SNDBUF and SO_RCVBUF before listen() and
connect()? Would that be an acceptable solution or does the test depend on
setting the buffer sizes this late?
Comment 4 Jiri Slaby 2023-11-03 10:08:56 UTC
This seems to go the way the test is broken:
It seems the test had some expectations.

Setting a small (1 byte) RCVBUF/SNDBUF, and yet expecting to send
46080 bytes fast enough was not reasonable.
It might have relied on the fact that tcp sendmsg() can cook large GSO
packets, even if sk->sk_sndbuf is small.

With tight memory settings, it is possible TCP has to resort on RTO
timers (200ms by default) to recover from dropped packets.

(In reply to Michal Kubeček from comment #3)
> How about my suggestion to set SO_SNDBUF and SO_RCVBUF before listen() and
> connect()? Would that be an acceptable solution or does the test depend on
> setting the buffer sizes this late?

And this ^^
Comment 5 Michal Kubeček 2023-11-03 10:38:37 UTC
For the record, actual value of SO_SNDBUF is 4608, as seen in the strace
output. I would assume SO_RCVBUF will also be something like that.
Comment 6 Dirk Mueller 2023-11-20 13:24:39 UTC
How can I help with the needinfo to me? Not clear to me. I think we should have this conversation with eventlet upstream?

I can disable the test if it is turning out to be problematic
Comment 7 Jiri Slaby 2023-11-27 14:00:21 UTC
I think we (I picked you likely beacuse of maintainership of the pkd) can do not much about it:
https://github.com/eventlet/eventlet/issues/821