Server Fault Asked by Thoro on December 13, 2021
I’m archiving wal files from a postgres server via rsync, most of the time the archiving works fine and is quick, the speedtest for the connection is here: (this goes via the internet)
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 9.30 MBytes 78.0 Mbits/sec 0 395 KBytes
[ 4] 1.00-2.00 sec 66.3 MBytes 556 Mbits/sec 14 1.05 MBytes
[ 4] 2.00-3.00 sec 75.0 MBytes 629 Mbits/sec 0 1.16 MBytes
[ 4] 3.00-4.00 sec 81.2 MBytes 682 Mbits/sec 0 1.24 MBytes
[ 4] 4.00-5.00 sec 86.2 MBytes 724 Mbits/sec 0 1.30 MBytes
[ 4] 5.00-6.00 sec 88.8 MBytes 744 Mbits/sec 0 1.34 MBytes
[ 4] 6.00-7.00 sec 91.2 MBytes 765 Mbits/sec 0 1.37 MBytes
[ 4] 7.00-8.00 sec 92.5 MBytes 776 Mbits/sec 0 1.38 MBytes
[ 4] 8.00-9.00 sec 93.8 MBytes 786 Mbits/sec 0 1.39 MBytes
[ 4] 9.00-10.00 sec 63.8 MBytes 535 Mbits/sec 22 535 KBytes
So, more than enough bandwidth available.
But on some WAL files it just crawls to a slow and it takes 30 – 50 seconds until the 16 MB file is transferred, and I’m at a loss where to debug / look for the issue.
The rsync command looks like this:
rsync -p --chmod=Fg+r,Fo+r --timeout 10 -e /usr/bin/ssh -i /var/lib/pgsql/.ssh/id_rsa -a pg_wal/000000080000A5500000005D barman@barman_host/data/database/pg/incoming/000000080000A5500000005D
I looked at rsync via strace on the receiving side, and it seems that just the packets from the sending side were not arriving fast enough.
I tried to cat the file via ssh and output it on my console, that was finished before the rsync transfer.
I tried to cat it to /dev/null, that was instant.
So I assume the source drive is fast enough.
I transferred a bulk of WAL files (60GB) via a single rsync command, that was also quick and averaged 65 MB/s, so that would tell me that everything works fine, still some files are slow.
What else can I look at?
How can I figure out if the issue is sending side, internet speed, receiving side, are there some special logs I can activate on rsync?
Can I check the timings of the syscalls via strace?
ls -l 000000080000A578000000E8
-rw-------. 1 postgres postgres 16777216 Jul 19 07:32 000000080000A578000000E8
bash-4.2$ du -sh 000000080000A578000000E8
11M 000000080000A578000000E8
bash-4.2$ du -sh 000000080000A578000000E8 --apparent-size
16M 000000080000A578000000E8
WAL drive is a ZFS with compression active, therefore the difference.
Also for completion, all zfs properties:
storage/database type filesystem -
storage/database creation Thu Apr 19 12:22 2018 -
storage/database used 1.33T -
storage/database available 369G -
storage/database referenced 1.33T -
storage/database compressratio 2.13x -
storage/database mounted yes -
storage/database quota none default
storage/database reservation none default
storage/database recordsize 16K inherited from storage
storage/database mountpoint /data/
local
storage/database sharenfs off default
storage/database checksum on default
storage/database compression lz4 inherited from storage
storage/database atime off inherited from storage
storage/database devices on default
storage/database exec on default
storage/database setuid on default
storage/database readonly off default
storage/database zoned off default
storage/database snapdir hidden default
storage/database aclinherit restricted default
storage/database createtxg 1159021 -
storage/database canmount on default
storage/database xattr sa inherited from storage
storage/database copies 1 default
storage/database version 5 -
storage/database utf8only off -
storage/database normalization none -
storage/database casesensitivity sensitive -
storage/database vscan off default
storage/database nbmand off default
storage/database sharesmb off default
storage/database refquota none default
storage/database refreservation none default
storage/database guid 8214081110063784152 -
storage/database primarycache all default
storage/database secondarycache all default
storage/database usedbysnapshots 0B -
storage/database usedbydataset 1.33T -
storage/database usedbychildren 0B -
storage/database usedbyrefreservation 0B -
storage/database logbias throughput inherited from storage
storage/database dedup off default
storage/database mlslabel none default
storage/database sync disabled local
storage/database dnodesize legacy default
storage/database refcompressratio 2.13x -
storage/database written 1.33T -
storage/database logicalused 2.82T -
storage/database logicalreferenced 2.82T -
storage/database volmode default default
storage/database filesystem_limit none default
storage/database snapshot_limit none default
storage/database filesystem_count none default
storage/database snapshot_count none default
storage/database snapdev hidden default
storage/database acltype off default
storage/database context none default
storage/database fscontext none default
storage/database defcontext none default
storage/database rootcontext none default
storage/database relatime off default
storage/database redundant_metadata all default
storage/database overlay off default
But nothing was changed in the past days on the ZFS drive – and this whole issue just started on Friday (17th July).
Also, if I copy paste the command, and run it again, it finishes instantly – the still running command will continue to hang.
With ls -lah you can follow how the temporary file gets bigger and bigger (around 150 KB/s)
Thanks for anyone taking time to read this!
Edit: I added timing recording to the wal archive process, here’s the result:
000000080000A57C00000034 1
000000080000A57C00000035 0
000000080000A57C00000036 0
000000080000A57C00000037 1
000000080000A57C00000038 1
000000080000A57C00000039 119
000000080000A57C0000003A 2
000000080000A57C0000003B 1
000000080000A57C0000003C 127
000000080000A57C0000003D 2
000000080000A57C0000003E 1
000000080000A57C0000003F 1
000000080000A57C00000040 1
000000080000A57C00000041 1
000000080000A57C00000042 1
000000080000A57C00000043 1
000000080000A57C00000044 1
000000080000A57C00000045 1
000000080000A57C00000046 1
000000080000A57C00000047 1
000000080000A57C00000048 1
000000080000A57C00000049 105
000000080000A57C0000004A 2
000000080000A57C0000004B 2
000000080000A57C0000004C 1
000000080000A57C0000004D 1
000000080000A57C0000004E 118
000000080000A57C0000004F 2
000000080000A57C00000050 1
000000080000A57C00000051 120
000000080000A57C00000052 2
000000080000A57C00000053 1
The number on the right is seconds it took to execute the Rsync command for the specified file.
Edit 2:
I have recreated the issue with 2 ram drives on each side. I extracted the ports used and have found they are all even (might be a hint)
I have switched between internet connections on my side (target) and the problem went away. Based on discussions it seems to be a network issue over a certain path (maybe due to Load Balancing)
I will update with the final resolution.
Edit 3:
Our provider is Hetzner, they had a fault in one of their DECIX modules (https://www.hetzner-status.de/#16045). After the deactivation the issue went away.
Since the fault was network related as initially thought, I would recommend anyone to follow the same steps.
Keeping this question open makes no sense, since it's an intermittent issue, but maybe someone in the future might have the same.
Answered by Thoro on December 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP