Source!
Tips and Recommendations for Storage Server Tuning
Here are some tips and recommendations on how to improve the performance
of your storage servers. As usual, the optimal settings depend on your
particular hardware and usage scenarios, so you should use these
settings only as a starting point for your tuning efforts.
Note: Some of the settings suggested here are non-persistent and will be reverted after the next reboot. To keep them permanently, you could add the corresponding commands to /etc/rc.local, use /etc/sysctl.conf or create udev rules to reapply them automatically when the machine boots. |
Partition Alignment & RAID Settings of Local File System
To get the maximum performance out of your RAID arrays and SSDs, it is important to set the partition offset according to the
native alignment. See here for a walk-through about partition alignment and creation of a
RAID-optimized local file system:
Partition Alignment Guide
A very simple and
commonly used method to achieve alignment without the challenges of partition alignment is to completely
avoid partitioning and instead create the file system directly on the device, e.g.:
$ mkfs.xfs /dev/sdX
Storage Server Throughput Tuning
In general, BeeGFS can be used with any of the standard Linux file systems.
Using
XFS for your storage server data partition is
generally recommended, because it scales very well for RAID arrays and
typically delivers a higher sustained write throughput on fast storage,
compared to alternative file systems. (There also have been significant
improvements of ext4 streaming performance in recent Linux kernel
versions).
However, the default Linux kernel settings are rather optimized for
single disk scenarios with low IO concurrency, so there are various
settings that need to be tuned to get the maximum performance out of
your storage servers.
Formatting Options
Make sure to enable
RAID optimizations of the underlying file system, as described in the last section here:
Create RAID-optimized File System
While BeeGFS uses dedicated metadata servers to manage global metadata, the
metadata performance
of the underlying file system on storage servers still matters for
operations like file creates, deletes, small reads/writes, etc. Recent
versions of XFS (similar work in progress for ext4) allow inlining of
data into inodes to avoid the need for additional blocks and the
corresponding expensive extra disk seeks for directories. In order to
use this efficiently, the inode size should be increased to 512 bytes or
larger.
Example: mkfs for XFS with larger inodes on 8
disks (where the number 8 does not include the number of RAID-5 or
RAID-6 parity disks) and 128KB chunk size:
$ mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/sdX
Mount Options
Enabling
last file access time is inefficient, because
it means that the file system needs to update the timestamp by writing
data to the disk even though the user only actually read file contents
or even when the file contents were already cached in memory and
actually no disk access would have been necessary at all. (Note: Recent
Linux kernels have switched to a new "relative atime" mode, so setting
noatime might not be necessary in these cases.)
If your users don't need last access times, you should disable them by adding "
noatime" to your mount options.
Increasing the number of log buffers and their size by adding
logbufs and
logbsize mount options allows XFS to generally handle and enqueue pending file and directory operations more efficiently.
There are also several mount options for XFS that are intended to further
optimize streaming performance on RAID storage, such as
largeio,
inode64, and
swalloc.
If you are using XFS and want to go for optimal
streaming write throughput, you might also want to add the mount option
allocsize=131072k to reduce the risk of fragmentation for large files.
If your RAID controller has a battery-backup-unit (
BBU), adding the mount option
nobarrier for XFS or ext4 can significantly increase throughput.
Example: Typical XFS mount options for an BeeGFS storage server with a RAID controller battery:
$ mount -onoatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint>
IO Scheduler
First, set an
appropriate IO scheduler for file servers:
$ echo deadline > /sys/block/sdX/queue/scheduler
Now give the IO scheduler more flexibility by increasing the
number of schedulable requests:
$ echo 4096 > /sys/block/sdX/queue/nr_requests
To improve
throughput for sequential reads, increase the maximum amount of read-ahead data. The actual amount of read-ahead is
adaptive, so using a high value here won't harm performance for small random access.
$ echo 4096 > /sys/block/sdX/queue/read_ahead_kb
Virtual memory settings
To
avoid long IO stalls (latencies) for write cache flushing in a production environment with very
different workloads, you will typically want to limit the kernel dirty (write) cache size:
$ echo 5 > /proc/sys/vm/dirty_background_ratio
$ echo 10 > /proc/sys/vm/dirty_ratio
Only for special use-cases: If you are going for
optimial sustained streaming performance, you may instead want to use
different settings that start asynchronous writes of data very early and
allow the major part of the RAM to be used for write caching. (For
generic use-cases, use the settings described above, instead.)
$ echo 1 > /proc/sys/vm/dirty_background_ratio
$ echo 75 > /proc/sys/vm/dirty_ratio
Assigning slightly
higher priority to inode caching helps to avoid disk seeks for inode loading:
$ echo 50 > /proc/sys/vm/vfs_cache_pressure
Buffering of file system data requires frequent memory allocation. Raising the amount of
reserved kernel memory
will enable faster and more reliable memory allocation in critical
situations. Raise the corresponding value to 64MB if you have less than
8GB of memory, otherwise raise it to at least 256MB:
$ echo 262144 > /proc/sys/vm/min_free_kbytes
Transparent huge pages can cause performance degradation
under high load, due to the frequent change of file system cache memory
areas. For RedHat 6.x and derivatives, it is recommended to
disable transparent huge pages support, unless huge pages are explicity requested by an application:
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag
With recent mainline kernel versions:
$ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
$ echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
Controller Settings
Optimal performance for hardware RAID systems often depends on
large IOs being sent to the device in a
single large operation. Please refer to your hardware storage vendor for the corresponding optimal size of
/sys/block/sdX/max_sectors_kb.
It is typically good if this size can be increased to at least match
your RAID stripe set size (i.e. chunk_size x number_of_disks):
$ echo 1024 > /sys/block/sdX/queue/max_sectors_kb
Furthermore,
high values of sg_tablesize (
/sys/class/scsi_host/*/sg_tablesize) are recommended to allow
large IOs. Those values depend on controller firmware versions, kernel versions and driver settings.
System BIOS & Power Saving
To allow the Linux kernel to correctly detect the system properties and
enable corresponding optimizations (e.g. for NUMA systems), it is very
important to
keep your system BIOS updated.
The
dynamic CPU clock frequency scaling feature for
power saving, which is often enabled by default, has a high impact on
latency. Thus, it is recommended to turn off dynamic CPU frequency
scaling. Ideally, this is done in the machine BIOS (see
Intel SpeedStep or
AMD PowerNow), but it can also be done at runtime, e.g. via:
$ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null