Hi everyone,
hopefully, the collective expertise of the forum can help me to settle the issue, because I don't know what else to do.
I have a virtual Netcup root server RS 4000 G9.5 in Nuremberg, some hosted storage space located in Helsinki from a different provider and I use Borg Backup to backup my data from the virtual root server. My problem: The backup is slow as hell. However, I am not able to find the root cause. I don't know whether the problem is I/O (disk) bound, CPU bound or network bound. No matter at what monitor statistics I look, each tells me that neither of the limits is the problem, but still the backup is slow.
Command to create Borg archive:
borg create \
--compression none \
--exclude-if-present '.nobackup' \
--keep-exclude-tags \
--exclude-from '/etc/borgbackup/exclude.conf' \
ssh://u377394@backup.mhnnet.de:23/~/borg-backups::'{hostname}-{utcnow:%Y-%m-%d_%H:%M:%S}' \
/etc \
/home \
/usr/local \
/var/backup/data \
/var/lib/portage/world \
/var/spool/mail/ \
/var/www
Display More
Note: I use no compression on purpose as most data (in /var/www) is highly compressed data (e.g. JPEGs) anyway. (It is mostly a Nextcloud installation.)
While the backup is running
- htop shows 4.6% CPU utilization on one core for python-3.11/borg
- iotop sporadically shows 3.5MB/s disk utilization on /dev/vda3 for python-3.11/borg
- wireshark shows me 9.8MBit/s network utilization on enp0s3ad
Neither of these numbers is near the theoretically maximum.
To find out the maximum disk I/O, I ran
root@server ~ $ fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test1 --filename=test1 --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.34
Starting 1 process
test1: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=132MiB/s,w=43.1MiB/s][r=33.8k,w=11.0k IOPS][eta 00m:00s]
test1: (groupid=0, jobs=1): err= 0: pid=12416: Sat Dec 2 14:17:57 2023
read: IOPS=38.6k, BW=151MiB/s (158MB/s)(3070MiB/20384msec)
bw ( KiB/s): min=40160, max=332584, per=100.00%, avg=155611.50, stdev=81780.78, samples=40
iops : min=10040, max=83146, avg=38902.93, stdev=20445.29, samples=40
write: IOPS=12.9k, BW=50.3MiB/s (52.8MB/s)(1026MiB/20384msec); 0 zone resets
bw ( KiB/s): min=13416, max=109768, per=100.00%, avg=52001.53, stdev=27407.36, samples=40
iops : min= 3354, max=27442, avg=13000.38, stdev=6851.83, samples=40
cpu : usr=9.30%, sys=32.49%, ctx=134727, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=151MiB/s (158MB/s), 151MiB/s-151MiB/s (158MB/s-158MB/s), io=3070MiB (3219MB), run=20384-20384msec
WRITE: bw=50.3MiB/s (52.8MB/s), 50.3MiB/s-50.3MiB/s (52.8MB/s-52.8MB/s), io=1026MiB (1076MB), run=20384-20384msec
Disk stats (read/write):
vda: ios=784345/262272, merge=0/91, ticks=1006159/184131, in_queue=1190487, util=99.08%
Display More
This tells me that the theoretical read performance of the disk is 151MiB/s. Yes, I know that traversing many files in a directory tree is slower than writing a single huge in large chunks, but 3.5MB/s is very far below 151MiB/s.
To find out the maximum network throughput between my server and the remote storage, I created a file with randomness, copied it over with scp and analyzed the result with Wireshark.
On one console I did
admin@server ~ $ tcpdump -i enp0s3 -s 86 -w /dev/shm/scp-to-backup.pcap "ip6 host 2a01:4f9:3b:5682::2 and tcp port 23"
tcpdump: listening on enp0s3, link-type EN10MB (Ethernet), snapshot length 86 bytes
442415 packets captured
442415 packets received by filter
0 packets dropped by kernel
Notes: I only dumped the first 86 bytes of each packet which is the size of the combined Ethernet, IPv6 and TCP header to avoid capturing the entire payload. The IP address 2a01:4f9:3b:5682::2 belongs to backup.mhnnet.de
On another console I did
backup@server ~ $ openssl rand -out /dev/shm/sample.txt -base64 $(( 2*2**30 * 3/4 ))
backup@server ~ $ scp -P 23 /dev/shm/sample.txt u377394@backup.mhnnet.de:/home/sample.txt
sample.txt 100% 2080MB 45.7MB/s 00:45
Here, we see 45.7MB/s and not 9.8MBit/s which I get when Borg Backup is running. The latter is 37 times slower! (Note bytes vs. bits). Also Wireshark shows me a nice 400 (4x10^8) MBit/s which is actually the theoretical maximum between Netcup and the other data center in Finland.
Screenshot_20231217_172721.png
So what makes Borg Backup so incredibly slow? Has anybody any clever ideas what to investigate next?