Feedback on UFS2 tuning for large number of small files (~100m)
(too old to reply)
Ciprian Dorin Craciun
2016-06-08 08:14:40 UTC
Hello all! (Please keep me in CC as I'm not subscribed on the mailing
list. Should I perhaps post this to the `freebsd-fs` mailing list?)

I would like your feedback on tuning a UFS2 file-system for the
following use-case, which is very similar to a maildir mail server. I
tried to look for hints on the internet, but found nothing more
in-depth than enabling soft-updates, `noatime`, etc.

The main usage of the file-system is:

* there are 4 separate files stores, each with about 50 million files,
all on the same partition;
* all of the 4 file stores have a dispersed layout on two levels (i.e.
`XX/YY/ZZ...`, where `ZZ...` is a 64 hexadecimal string); (as a
consequence there shouldn't be more than one thousand files per leaf
* all of the files above are around 2-3 KiB;
* these files are read-mostly, and they are never deleted;
* there is almost no access contention, neither read or write;

* there are 4 matching "queue" stores, dispersed on a single level,
containing symlinks;
* each symlink points to a path roughly 100-200 characters in length;
* I wouldn't expect more than a few thousand files for each store;
* the symlinks are constantly `rename`-d-in and `rename`-d-out
in-and-out of these folders;
* these folders are constantly listed, by 4-32 parallel processes (not
* (basically I use stores to emulate a queuing system, and I'm careful
that each process tries randomly the leaf folders, thus reducing
contention; and also pausing if the queue "seems" empty;)

As sidenotes:

* the partition is backed by two mirrored disks (which I'm assuming
are rotating SCSI disks);
* persistence in case of power or system failure (i.e. files getting
truncated or missing) is not so critical for my use-case;
* however file-system consistency on failure (i.e. getting a correct
mounted file-system) is important, thus from what I've read from the
`mount` man-page, `async` is not an option;
* the system has plenty of RAM (32 GiB), however it is constantly
under 100% CPU load by processes on nice level 10;
* this system is dedicated to the task at hand, therefore there is no
other background contention;

The problem that prompted me to ask the community for feedback is that
under load (i.e. 100% CPU usage by processes on nice level 10), even
listing the file-system seems to stall, ranging from a fraction of
second up to a few seconds.

The output of `iostat -w 30 -d -C -x -I` under load is (the values are
cumulated per 30 seconds, thus not average per second):
device r/i w/i kr/i kw/i qlen
tsvc_t/i sb/i us ni sy in id
ada0 1243893.0 4988740.0 6447101.5 311428382.5 600
812579.1 8698.9 0 0 0 0 100
ada1 1243889.0 4988824.0 6429851.0 311428550.5 520
766389.6 8437.3

device r/i w/i kr/i kw/i qlen
tsvc_t/i sb/i us ni sy in id
ada0 582.0 12510.0 2328.0 152986.5 383
9463.4 28.9 0 3 1 0 96
ada1 587.0 12465.0 2348.0 152806.5 343
9107.8 28.7

device r/i w/i kr/i kw/i qlen
tsvc_t/i sb/i us ni sy in id
ada0 792.0 12933.0 3168.0 157643.5 542
11178.8 29.1 0 3 1 0 96
ada1 791.0 12893.0 3164.0 157651.5 544
10591.2 28.5

The file-system is mounted with the following options:
ufs rw,noatime

The `dumpefs` of the file-system outputs the following:
magic 19540119 (UFS2) time Sat Jun 4 05:59:23 2016
superblock location 65536 id [ 56cb7a3f 33fd7a56 ]
ncg 2897 size 464257019 blocks 449679279
bsize 32768 shift 15 mask 0xffff8000
fsize 4096 shift 12 mask 0xfffff000
frag 8 shift 3 fsbtodb 3
minfree 8% optim time symlinklen 120
maxbsize 32768 maxbpg 4096 maxcontig 4 contigsumsize 4
nbfree 56167793 ndir 265137 nifree 232205846 nffree 9111
bpg 20035 fpg 160280 ipg 80256 unrefs 0
nindir 4096 inopb 128 maxfilesize 2252349704110079
sbsize 4096 cgsize 32768 csaddr 5056 cssize 49152
sblkno 24 cblkno 32 iblkno 40 dblkno 5056
cgrotor 0 fmod 0 ronly 0 clean 0
metaspace 6408 avgfpdir 64 avgfilesize 16384
flags soft-updates+journal
fsmnt /some-path
volname swuid 0 providersize 464257019

Thus I would like to ask the community what I can tune (even by
re-formatting) to make it more "responsive", and alternatively I am
open to another file-system type, perhaps more suited for this

Eduardo Morras via freebsd-questions
2016-06-08 10:53:25 UTC
On Wed, 8 Jun 2016 11:14:40 +0300
Post by Ciprian Dorin Craciun
Hello all! (Please keep me in CC as I'm not subscribed on the mailing
list. Should I perhaps post this to the `freebsd-fs` mailing list?)
I would like your feedback on tuning a UFS2 file-system for the
following use-case, which is very similar to a maildir mail server. I
tried to look for hints on the internet, but found nothing more
in-depth than enabling soft-updates, `noatime`, etc.
You can use tunefs to set:

a) average file size and expected number of files per directory,
b) check if your filesystem has any ACL flags on (NFS, POSIX, whatever) and disable them if you don't use ACLs,
c) if filesystem is full or near full, system switch from faster writes to minimize fragmentation strategy (slower), force the fast access optimization.

They requires a umount/mount only no reformating.

Newfs can set those values at fs formatting/createing.

There are some sysctl you can tweak, again without reformating:

vfs.ufs.dirhash_maxmem: maximum allowed dirhash memory usage

Defaul value is 6MB. Check first dirhash usage with

#sysctl vfs.ufs.dirhash_mem

and if it's equal or close to maxmem, grow it (24MB or more, f.ex.)

Tuning at ffs level are more tricky and risky, check sysctl vfs.ffs.*

--- ---
Eduardo Morras <***@yahoo.es>
Eduardo Morras via freebsd-questions
2016-06-08 11:15:52 UTC
On Wed, 8 Jun 2016 12:53:25 +0200
Eduardo Morras via freebsd-questions <freebsd-***@freebsd.org>

Forget to say that systat gives better explained statistics than iostat, run:

%systat -vmstat [number_seconds_refresh]

and the info you need is from "Namei" to the rigth and below (3rd quadrant).

--- ---
Eduardo Morras <***@yahoo.es>
Arthur Chance
2016-06-08 13:58:54 UTC
Post by Eduardo Morras via freebsd-questions
On Wed, 8 Jun 2016 11:14:40 +0300
Post by Ciprian Dorin Craciun
Hello all! (Please keep me in CC as I'm not subscribed on the mailing
list. Should I perhaps post this to the `freebsd-fs` mailing list?)
I would like your feedback on tuning a UFS2 file-system for the
following use-case, which is very similar to a maildir mail server. I
tried to look for hints on the internet, but found nothing more
in-depth than enabling soft-updates, `noatime`, etc.
a) average file size and expected number of files per directory,
b) check if your filesystem has any ACL flags on (NFS, POSIX, whatever) and disable them if you don't use ACLs,
c) if filesystem is full or near full, system switch from faster writes to minimize fragmentation strategy (slower), force the fast access optimization.
They requires a umount/mount only no reformating.
Newfs can set those values at fs formatting/createing.
vfs.ufs.dirhash_maxmem: maximum allowed dirhash memory usage
Minor point: this looks like it's dependent on the memory size. On my
32GB machine it's 27.8 MB, on my 4GB machine it's 6.5 MB, and on a 2GB
machine it's 3.3 MB.
Post by Eduardo Morras via freebsd-questions
Defaul value is 6MB. Check first dirhash usage with
#sysctl vfs.ufs.dirhash_mem
and if it's equal or close to maxmem, grow it (24MB or more, f.ex.)
Tuning at ffs level are more tricky and risky, check sysctl vfs.ffs.*
Moore's Law of Mad Science: Every eighteen months, the minimum IQ
necessary to destroy the world drops by one point.
Eduardo Morras via freebsd-questions
2016-06-08 15:19:09 UTC
On Wed, 8 Jun 2016 14:58:54 +0100
Post by Arthur Chance
Post by Eduardo Morras via freebsd-questions
On Wed, 8 Jun 2016 11:14:40 +0300
Post by Ciprian Dorin Craciun
Hello all! (Please keep me in CC as I'm not subscribed on the
mailing list. Should I perhaps post this to the `freebsd-fs`
mailing list?)
vfs.ufs.dirhash_maxmem: maximum allowed dirhash memory usage
Minor point: this looks like it's dependent on the memory size. On my
32GB machine it's 27.8 MB, on my 4GB machine it's 6.5 MB, and on a 2GB
machine it's 3.3 MB.
Yes, you are right. Don't know when it changed, initially it had 2MB static and thought (wrongly) that it's still static value.

My fault.
Post by Arthur Chance
Post by Eduardo Morras via freebsd-questions
Defaul value is 6MB. Check first dirhash usage with
--- ---
Eduardo Morras <***@yahoo.es>
