Source!
Here are some tips and recommendations on how to improve the performance of your storage servers. As usual, the optimal settings depend on your particular hardware and usage scenarios, so you should use these settings only as a starting point for your tuning efforts.
To get the maximum performance out of your RAID arrays and SSDs, it is important to set the partition offset according to the native alignment. See here for a walk-through about partition alignment and creation of a RAID-optimized local file system: Partition Alignment Guide
A very simple and commonly used method to achieve alignment without the challenges of partition alignment is to completely avoid partitioning and instead create the file system directly on the device, e.g.:
In general, BeeGFS can be used with any of the standard Linux file systems.
Using XFS for your storage server data partition is generally recommended, because it scales very well for RAID arrays and typically delivers a higher sustained write throughput on fast storage, compared to alternative file systems. (There also have been significant improvements of ext4 streaming performance in recent Linux kernel versions).
However, the default Linux kernel settings are rather optimized for single disk scenarios with low IO concurrency, so there are various settings that need to be tuned to get the maximum performance out of your storage servers.
While BeeGFS uses dedicated metadata servers to manage global metadata, the metadata performance of the underlying file system on storage servers still matters for operations like file creates, deletes, small reads/writes, etc. Recent versions of XFS (similar work in progress for ext4) allow inlining of data into inodes to avoid the need for additional blocks and the corresponding expensive extra disk seeks for directories. In order to use this efficiently, the inode size should be increased to 512 bytes or larger.
Example: mkfs for XFS with larger inodes on 8 disks (where the number 8 does not include the number of RAID-5 or RAID-6 parity disks) and 128KB chunk size:
If your users don't need last access times, you should disable them by adding "noatime" to your mount options.
Increasing the number of log buffers and their size by adding logbufs and logbsize mount options allows XFS to generally handle and enqueue pending file and directory operations more efficiently.
There are also several mount options for XFS that are intended to further optimize streaming performance on RAID storage, such as largeio, inode64, and swalloc.
If you are using XFS and want to go for optimal streaming write throughput, you might also want to add the mount option allocsize=131072k to reduce the risk of fragmentation for large files.
If your RAID controller has a battery-backup-unit (BBU), adding the mount option nobarrier for XFS or ext4 can significantly increase throughput.
Example: Typical XFS mount options for an BeeGFS storage server with a RAID controller battery:
First, set an appropriate IO scheduler for file servers:
Now give the IO scheduler more flexibility by increasing the number of schedulable requests:
To improve throughput for sequential reads, increase the maximum amount of read-ahead data. The actual amount of read-ahead is adaptive, so using a high value here won't harm performance for small random access.
To avoid long IO stalls (latencies) for write cache flushing in a production environment with very different workloads, you will typically want to limit the kernel dirty (write) cache size:
Tips and Recommendations for Storage Server Tuning
Here are some tips and recommendations on how to improve the performance of your storage servers. As usual, the optimal settings depend on your particular hardware and usage scenarios, so you should use these settings only as a starting point for your tuning efforts.
Note: Some of the settings suggested here are non-persistent and will be reverted after the next reboot. To keep them permanently, you could add the corresponding commands to /etc/rc.local, use /etc/sysctl.conf or create udev rules to reapply them automatically when the machine boots. |
Partition Alignment & RAID Settings of Local File System
To get the maximum performance out of your RAID arrays and SSDs, it is important to set the partition offset according to the native alignment. See here for a walk-through about partition alignment and creation of a RAID-optimized local file system: Partition Alignment Guide
A very simple and commonly used method to achieve alignment without the challenges of partition alignment is to completely avoid partitioning and instead create the file system directly on the device, e.g.:
$ mkfs.xfs /dev/sdX
Storage Server Throughput Tuning
In general, BeeGFS can be used with any of the standard Linux file systems.
Using XFS for your storage server data partition is generally recommended, because it scales very well for RAID arrays and typically delivers a higher sustained write throughput on fast storage, compared to alternative file systems. (There also have been significant improvements of ext4 streaming performance in recent Linux kernel versions).
However, the default Linux kernel settings are rather optimized for single disk scenarios with low IO concurrency, so there are various settings that need to be tuned to get the maximum performance out of your storage servers.
Formatting Options
Make sure to enable RAID optimizations of the underlying file system, as described in the last section here: Create RAID-optimized File SystemWhile BeeGFS uses dedicated metadata servers to manage global metadata, the metadata performance of the underlying file system on storage servers still matters for operations like file creates, deletes, small reads/writes, etc. Recent versions of XFS (similar work in progress for ext4) allow inlining of data into inodes to avoid the need for additional blocks and the corresponding expensive extra disk seeks for directories. In order to use this efficiently, the inode size should be increased to 512 bytes or larger.
Example: mkfs for XFS with larger inodes on 8 disks (where the number 8 does not include the number of RAID-5 or RAID-6 parity disks) and 128KB chunk size:
$ mkfs.xfs -d su=128k,sw=8 -l version=2,su=128k -isize=512 /dev/sdX
Mount Options
Enabling last file access time is inefficient, because it means that the file system needs to update the timestamp by writing data to the disk even though the user only actually read file contents or even when the file contents were already cached in memory and actually no disk access would have been necessary at all. (Note: Recent Linux kernels have switched to a new "relative atime" mode, so setting noatime might not be necessary in these cases.)If your users don't need last access times, you should disable them by adding "noatime" to your mount options.
Increasing the number of log buffers and their size by adding logbufs and logbsize mount options allows XFS to generally handle and enqueue pending file and directory operations more efficiently.
There are also several mount options for XFS that are intended to further optimize streaming performance on RAID storage, such as largeio, inode64, and swalloc.
If you are using XFS and want to go for optimal streaming write throughput, you might also want to add the mount option allocsize=131072k to reduce the risk of fragmentation for large files.
If your RAID controller has a battery-backup-unit (BBU), adding the mount option nobarrier for XFS or ext4 can significantly increase throughput.
Example: Typical XFS mount options for an BeeGFS storage server with a RAID controller battery:
$ mount -onoatime,nodiratime,logbufs=8,logbsize=256k,largeio,inode64,swalloc,allocsize=131072k,nobarrier /dev/sdX <mountpoint>
IO Scheduler
First, set an appropriate IO scheduler for file servers:
$ echo deadline > /sys/block/sdX/queue/scheduler
Now give the IO scheduler more flexibility by increasing the number of schedulable requests:
$ echo 4096 > /sys/block/sdX/queue/nr_requests
To improve throughput for sequential reads, increase the maximum amount of read-ahead data. The actual amount of read-ahead is adaptive, so using a high value here won't harm performance for small random access.
$ echo 4096 > /sys/block/sdX/queue/read_ahead_kb
Virtual memory settings
To avoid long IO stalls (latencies) for write cache flushing in a production environment with very different workloads, you will typically want to limit the kernel dirty (write) cache size:
$ echo 5 > /proc/sys/vm/dirty_background_ratio
$ echo 10 > /proc/sys/vm/dirty_ratio
Only for special use-cases: If you are going for
optimial sustained streaming performance, you may instead want to use
different settings that start asynchronous writes of data very early and
allow the major part of the RAM to be used for write caching. (For
generic use-cases, use the settings described above, instead.)$ echo 10 > /proc/sys/vm/dirty_ratio
$ echo 1 > /proc/sys/vm/dirty_background_ratio
$ echo 75 > /proc/sys/vm/dirty_ratio
Assigning slightly higher priority to inode caching helps to avoid disk seeks for inode loading:$ echo 75 > /proc/sys/vm/dirty_ratio
$ echo 50 > /proc/sys/vm/vfs_cache_pressure
Buffering of file system data requires frequent memory allocation. Raising the amount of reserved kernel memory
will enable faster and more reliable memory allocation in critical
situations. Raise the corresponding value to 64MB if you have less than
8GB of memory, otherwise raise it to at least 256MB:
$ echo 262144 > /proc/sys/vm/min_free_kbytes
Transparent huge pages can cause performance degradation
under high load, due to the frequent change of file system cache memory
areas. For RedHat 6.x and derivatives, it is recommended to disable transparent huge pages support, unless huge pages are explicity requested by an application:
With recent mainline kernel versions:
Optimal performance for hardware RAID systems often depends on large IOs being sent to the device in a single large operation. Please refer to your hardware storage vendor for the corresponding optimal size of /sys/block/sdX/max_sectors_kb. It is typically good if this size can be increased to at least match your RAID stripe set size (i.e. chunk_size x number_of_disks):
The dynamic CPU clock frequency scaling feature for power saving, which is often enabled by default, has a high impact on latency. Thus, it is recommended to turn off dynamic CPU frequency scaling. Ideally, this is done in the machine BIOS (see Intel SpeedStep or AMD PowerNow), but it can also be done at runtime, e.g. via:
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag
$ echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag
With recent mainline kernel versions:
$ echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
$ echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
$ echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
Controller Settings
Optimal performance for hardware RAID systems often depends on large IOs being sent to the device in a single large operation. Please refer to your hardware storage vendor for the corresponding optimal size of /sys/block/sdX/max_sectors_kb. It is typically good if this size can be increased to at least match your RAID stripe set size (i.e. chunk_size x number_of_disks):
$ echo 1024 > /sys/block/sdX/queue/max_sectors_kb
Furthermore, high values of sg_tablesize (/sys/class/scsi_host/*/sg_tablesize) are recommended to allow large IOs. Those values depend on controller firmware versions, kernel versions and driver settings.System BIOS & Power Saving
To allow the Linux kernel to correctly detect the system properties and enable corresponding optimizations (e.g. for NUMA systems), it is very important to keep your system BIOS updated.The dynamic CPU clock frequency scaling feature for power saving, which is often enabled by default, has a high impact on latency. Thus, it is recommended to turn off dynamic CPU frequency scaling. Ideally, this is done in the machine BIOS (see Intel SpeedStep or AMD PowerNow), but it can also be done at runtime, e.g. via:
$ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor >/dev/null
Комментариев нет:
Отправить комментарий
Примечание. Отправлять комментарии могут только участники этого блога.