Lustre
Lustre
Lustre
This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by
intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate,
broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering,
disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us
in writing.
If this is software or related software documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the
following notice is applicable:
U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers
are "commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific
supplemental regulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set
forth in the applicable Government contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR
52.227-19, Commercial Computer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.
This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any
inherently dangerous applications, including applications which may create a risk of personal injury. If you use this software or hardware in dangerous
applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle
Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.
AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced Micro Devices. Intel and Intel Xeon are
trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of
SPARC International, Inc. UNIX is a registered trademark licensed through X/Open Company, Ltd.
This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle
Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and
services. Oracle Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party
content, products, or services.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license and obtain more
information about Creative Commons licensing, visit Creative Commons Attribution-Share Alike 3.0 United States or send a letter to Creative
Commons, 171 2nd Street, Suite 300, San Francisco, California 94105, USA.
Please
Recycle
Contents
Preface xxi
iii
3.1 What is Failover? 3–2
3.1.1 Failover Capabilities 3–2
3.1.2 Types of Failover Configurations 3–3
3.2 Failover Functionality in Lustre 3–4
3.2.1 MDT Failover Configuration (Active/Passive) 3–5
3.2.2 OST Failover Configuration (Active/Active) 3–5
Contents v
8.1.2 Environmental Requirements 8–5
8.2 Lustre Installation Procedure 8–6
Contents vii
14. Lustre Maintenance 14–1
14.1 Working with Inactive OSTs 14–2
14.2 Finding Nodes in the Lustre File System 14–2
14.3 Mounting a Server Without Lustre Service 14–3
14.4 Regenerating Lustre Configuration Logs 14–4
14.5 Changing a Server NID 14–5
14.6 Adding a New OST to a Lustre File System 14–7
14.7 Removing and Restoring OSTs 14–8
14.7.1 Removing an OST from the File System 14–8
14.7.2 Backing Up OST Configuration Files 14–10
14.7.3 Restoring OST Configuration Files 14–11
14.7.4 Returning a Deactivated OST to Service 14–12
14.8 Aborting Recovery 14–12
14.9 Determining Which Machine is Serving an OST 14–13
14.10 Changing the Address of a Failover Node 14–13
Contents ix
18.4 Retrieving File Layout/Striping Information (getstripe) 18–8
18.4.1 Displaying the Current Stripe Size 18–8
18.4.2 Inspecting the File Tree 18–8
18.5 Managing Free Space 18–9
18.5.1 Checking File System Free Space 18–9
18.5.2 Using Stripe Allocations 18–10
18.5.3 Adjusting the Weighting Between Free Space and Location 18–11
Contents xi
23.2.3 Defining and Running the Tests 23–5
23.2.4 Sample Script 23–6
23.3 LNET Self-Test Command Reference 23–7
23.3.1 Session Commands 23–7
23.3.2 Group Commands 23–8
23.3.3 Batch and Test Commands 23–11
23.3.4 Other Commands 23–15
Contents xiii
26.3.14 Setting SCSI I/O Sizes 26–16
Part VI Reference
Contents xv
30.2.6 Request Replay 30–9
30.2.7 Gaps in the Replay Sequence 30–9
30.2.8 Lock Recovery 30–10
30.2.9 Request Resend 30–10
30.3 Reply Reconstruction 30–11
30.3.1 Required State 30–11
30.3.2 Reconstruction of Open Replies 30–11
30.4 Version-based Recovery 30–13
30.4.1 VBR Messages 30–14
30.4.2 Tips for Using VBR 30–14
30.5 Commit on Share 30–15
30.5.1 Working with Commit on Share 30–15
30.5.2 Tuning Commit On Share 30–16
Contents xvii
33.1.3 Parameters 33–3
33.1.4 Data Structures 33–4
33.2 l_getgroups Utility 33–4
Glossary Glossary–1
Index Index–1
Contents xix
xx Lustre 2.0 Operations Manual • January 2011
Preface
UNIX Commands
This document might not contain information about basic UNIX commands and
procedures such as shutting down the system, booting the system, and configuring
devices. Refer to the following for this information:
■ Software documentation that you received with your system
■ Oracle Solaris Operating System documentation, which is at:
http://docs.sun.com
xxi
Shell Prompts
Shell Prompt
C shell machine-name%
C shell superuser machine-name#
Bourne shell and Korn shell $
Bourne shell and Korn shell superuser #
Related Documentation
The documents listed as online are available at:
http://docs.sun.com/app/docs/prod/lustre.fs20?l=en&a=view
Lustre 2.0 Operations Manual 821-2076-10 July 2010 First release of Lustre 2.0 manual
Lustre 2.0 Operations Manual 821-2076-10 January 2011 Second release of Lustre 2.0 manual
PA RT I Introducing Lustre
Understanding Lustre
Understanding Lustre
This chapter describes the Lustre architecture and features of Lustre. It includes the
following sections:
■ What Lustre Is (and What It Isn’t)
■ Lustre Components
■ Lustre Storage and I/O
1-1
1.1 What Lustre Is (and What It Isn’t)
Lustre is a storage architecture for clusters. The central component of the Lustre
architecture is the Lustre file system, which is supported on the Linux operating
system and provides a POSIX-compliant UNIX file system interface.
The Lustre storage architecture is used for many different kinds of clusters. It is best
known for powering seven of the ten largest high-performance computing (HPC)
clusters worldwide, with tens of thousands of client systems, petabytes (PB) of
storage and hundreds of gigabytes per second (GB/sec) of I/O throughput. Many
HPC sites use Lustre as a site-wide global file system, serving dozens of clusters.
The ability of a Lustre file system to scale capacity and performance for any need
reduces the need to deploy many separate file systems, such as one for each compute
cluster. Storage management is simplified by avoiding the need to copy data
between compute clusters. In addition to aggregating storage capacity of many
servers, the I/O throughput is also aggregated and scales with additional servers.
Moreover, throughput and/or capacity can be easily increased by adding servers
dynamically.
While Lustre can function in many work environments, it is not necessarily the best
choice for all applications. It is best suited for uses that exceed the capacity that a
single server can provide, though in some use cases Lustre can perform better with a
single server than other filesystems due to its strong locking and data coherency.
Lustre is currently not particularly well suited for "peer-to-peer" usage models where
there are clients and servers running on the same node, each sharing a small amount
of storage, due to the lack of Lustre-level data replication. In such uses, if one
client/server fails, then the data stored on that node will not be accessible until the
node is restarted.
A Lustre installation can be scaled up or down with respect to the number of client
nodes, disk storage and bandwidth. Scalability and performance are dependent on
available disk and network bandwith and the processing power of the servers in the
system. Lustre can be deployed in a wide variety of configurations that can be scaled
well beyond the size and performance observed in production systems to date.
TABLE 1-1 shows the practical range of scalability and performance characteristics of
the Lustre file system and some test results in production systems.
It is preferable that the MGS have its own storage space so that it can be managed
independently. However, the MGS can be“co-located” and share storage space with
an MDS as shown in FIGURE 1-1.
TABLE 1-2 provides the requirements for attached storage for each Lustre file system
component and describes desirable characterics of the hardware used.
In FIGURE 1-3, each filename points to an inode. The inode contains all of the file
attributes, such as owner, access permissions, Lustre striping layout, access time, and
access control. Multiple filenames may point to the same inode.
FIGURE 1-3 MDT file points to objects on OSTs containing file data
Each file on the MDT contains the layout of the associated data file, including the
OST number and object identifier. Clients request the file layout from the MDS and
then perform file I/O operations by communicating directly with the OSSs that
manage that file data.
Each object contains a chunk of data from the file. When the chunk of data being
written to a particular object exceeds the stripe_size, the next chunk of data in the
file is stored on the next object.
Default values for stripe_count and stripe_size are set for the file system. The
default value for stripe_count is 1 stripe for file and the default value for
stripe_size is 1MB. The user may change these values on a per directory or per file
basis. For more details, see Section 18.3, “Setting the File Layout/Striping
Configuration (lfs setstripe)” on page 18-4.
In FIGURE 1-5, the stripe_size for File C is larger than the stripe_size for File A,
allowing more data to be stored in a single stripe for File C. The stripe_count for
File A is 3, resulting in data striped across three objects, while the stripe_count for
File B and File C is 1.
No space is reserved on the OST for unwritten data. File A in FIGURE 1-5 is a sparse
file that is missing chunk 6.
The maximum file size is not limited by the size of a single target. Lustre can stripe
files across multiple objects (up to 160), and each object can be up to 2 TB in size. This
leads to a maximum file size of 320 TB. (Note that Lustre itself can support files up to
2^64 bytes depending on the backing storage used by OSTs.)
Athough a single file can only be striped over 160 objects, Lustre file systems can
have thousands of OSTs. The I/O bandwidth to access a single file is the aggregated
I/O bandwidth to the objects in a file, which can be as much as a bandwidth of up to
160 servers. On systems with more than 160 OSTs, clients can do I/O using multiple
files to utilize the full file system bandwidth.
For more information about striping, see Chapter 18: Managing File Striping and Free
Space.
This chapter introduces Lustre Networking (LNET) and includes the following
sections:
■ Introducing LNET
■ Key Features of LNET
■ Supported Network Types
2-1
2.1 Introducing LNET
In a cluster with a Lustre file system, the system network connecting the servers and
the clients is implemented using Lustre Networking (LNET), which provides the
communication infrastructure required by the Lustre file system.
An LND is a pluggable driver that provides support for a particular network type.
LNDs are loaded into the driver stack, with one LND for each network type in use.
For information about administering LNET, see Part III: Administering Lustre.
Lustre can use bonded networks ,such as bonded Ethernet networks, when the
underlying network technology supports bonding. For more information, see
Chapter 7: Understanding Lustre Networking (LNET).
3-1
3.1 What is Failover?
A computer system is ''highly available'' when the services it provides are available
with minimal downtime. In a highly-available system, if a failure condition occurs,
such as the loss of a server or a network or software fault, the system’s services
continue without interruption. Generally, we measure availability by the percentage
of time the system is required to be available.
A failover hardware setup requires a pair of servers with a shared resource (typically
a physical storage device, which may be based on SAN, NAS, hardware RAID, SCSI
or FC technology). The method of sharing storage should be essentially transparent
at the device level; the same physical logical unit number (LUN) should be visible
from both servers. To ensure high availability at the physical storage level, we
encourage the use of RAID arrays to protect against drive-level failures.
Note – Lustre does not provide redundancy for data; it depends exclusively on
redundancy of backing storage devices. The backing OST storage should be RAID 5
or, preferably, RAID 6 storage. MDT storage should be RAID 1 or RAID 0+1.
Typically, Lustre MDSs are configured as an active/passive pair, while OSSs are
deployed in an active/active configuration that provides redundancy without extra
overhead. Often the standby MDS is the active MDS for another Lustre file system or
the MGS, so no nodes are idle in the cluster.
Lustre failover requires two nodes configured as a failover pair, which must share
one or more storage devices. Lustre can be configured to provide MDT or OST
failover.
■ For MDT failover, two MDSs are configured to serve the same MDT. Only one
MDS node can serve an MDT at a time.
■ For OST failover, multiple OSS nodes are configured to be able to serve the same
OST. However, only one OSS node can serve the OST at a time. An OST can be
moved between OSS nodes that have access to the same storage device using
umount/mount commands.
Lustre failover capability can be used to upgrade the Lustre software between
successive minor versions without cluster downtime. For more information, see
Chapter 16: Upgrading Lustre.
For information about configuring failover, see Chapter 11: Configuring Lustre
Failover.
Note – Failover functionality in Lustre is provided only at the file system level. In a
complete failover solution, failover functionality for system-level components, such
as node failure detection or power control, must be provided by a third-party tool.
Caution – OST failover functionality does not protect against corruption caused by
a disk failure. If the storage media (i.e., physical disk) used for an OST fails, Lustre
cannot recover it. We strongly recommend that some form of RAID be used for OSTs.
Lustre functionality assumes that the storage is reliable, so it adds no extra reliability
features.
Note – In an environment with multiple file systems, the MDSs can be configured in
a quasi active/active configuration, with each MDS managing metadata for a subset
of the Lustre file system.
In an active configuration, 50% of the available OSTs are assigned to one OSS and the
remaining OSTs are assigned to the other OSS. Each OSS serves as the primary node
for half the OSTs and as a failover node for the remaining OSTs.
In this mode, if one OSS fails, the other OSS takes over all of the failed OSTs. The
clients attempt to connect to each OSS serving the OST, until one of them responds.
Data on the OST is written synchronously, and the clients replay transactions that
were in progress and uncommitted to disk before the OST failure.
Part II describes how to install and configure a Lustre file system. You will find
information in this section about:
Installation Overview
Configuring Lustre
For more information about required and optional steps to installing and configuring
Lustre, proceed to Chapter 4: Installation Overview.
CHAPTER 4
Installation Overview
This chapter provides on overview of the procedures required to set up, install and
configure a Lustre file system.
Note – If you are new to Lustre, you may find it helpful to refer to
Part I: Introducing Lustre for a description of the Lustre architecture, file system
components and terminology before proceeding with the installation procedure.
4-1
4.1 Steps to Installing Lustre
To set up Lustre file system hardware and install and configure the Lustre software,
refer the the chapters below in the order listed:
This chapter describes hardware configuration requirements for a Lustre file system
including:
■ Hardware Considerations
■ Determining Space Requirements
■ Setting File System Formatting Options
■ Determining Memory Requirements
■ Implementing Networks To Be Used by Lustre
5-1
5.1 Hardware Considerations
Lustre can work with any kind of block storage device such as single disks, software
RAID, hardware RAID, or a logical volume manager. In contrast to some networked
file systems, the block devices are only attached to the MDS and OSS nodes in Lustre
and are not accessed by the clients directly.
Since the block devices are accessed by only one or two server nodes, a storage area
network (SAN) that is accessible from all the servers is not required. Expensive
switches are not needed because point-to-point connections between the servers and
the storage arrays normally provide the simplest and best attachments. (If failover
capability is desired, the storage must be attached to multiple servers.)
For a production environment, it is preferable that the MGS have separate storage to
allow future expansion to multiple file systems. However, it is possible to run the
MDS and MGS on the same machine and have them share the same storage device.
Performance and other issues can occur when an MDS or OSS and a client are
running on the same machine:
■ Running the MDS and a client on the same machine can cause recovery and
deadlock issues and impact the performance of other Lustre clients.
■ Running the OSS and a client on the same machine can cause issues with low
memory and memory pressure. If the client consumes all the memory and then
tries to write data to the file system, the OSS will need to allocate pages to receive
data from the client but will not be able to perform this operation due to low
memory. This can cause the client to hang.
Only servers running on 64-bit CPUs are tested and supported. 64-bit CPU clients are
typically used for testing to match expected customer usage and avoid limitations
due to the 4 GB limit for RAM size, 1 GB low-memory limitation, and 16 TB file size
limit of 32-bit CPUs. Also, due to kernel API limitations, performing backups of
Lustre 2.x. filesystems on 32-bit clients may cause backup tools to confuse files that
have the same 32-bit inode number.
Lustre uses journaling file system technology on both the MDTs and OSTs. For a
MDT, as much as a 20 percent performance gain can be obtained by placing the
journal on a separate device.
The MDS can effectively utilize a lot of CPU cycles. A minimium of four processor
cores are recommended. More are advisable for files systems with many clients.
If you need a larger MDT, create multiple RAID1 devices from pairs of disks, and
then make a RAID0 array of the RAID1 devices. This ensures maximum reliability
because multiple disk failures only have a small chance of hitting both disks in the
same RAID1 device.
Doing the opposite (RAID1 of a pair of RAID0 devices) has a 50% chance that even
two disk failures can cause the loss of the whole MDT device. The first failure
disables an entire half of the mirror and the second failure has a 50% chance of
disabling the remaining mirror.
Lustre file system capacity is the sum of the capacities provided by the targets. For
example, 64 OSSs, each with two 8 TB targets, provide a file system with a capacity
of nearly 1 PB. If each OST uses ten 1 TB SATA disks (8 data disks plus 2 parity disks
in a RAID 6 configuration), it may be possible to get 50 MB/sec from each drive,
providing up to 400 MB/sec of disk bandwidth per OST. If this system is used as
storage backend with a system network like InfiniBand that provides a similar
bandwidth, then each OSS could provide 800 MB/sec of end-to-end I/O throughput.
(Although the architectural constraints described here are simple, in practice it takes
careful hardware selection, benchmarking and integration to obtain such results.)
Each time a file is created on a Lustre file system, it consumes one inode on the MDT
and one inode for each OST object over which the file is striped. Normally, each file’s
stripe count is based on the system-wide default stripe count. However, this can be
changed for individual files using the lfs setstripe option. For more details,see
Section 18.3, “Setting the File Layout/Striping Configuration (lfs setstripe)” on
page 18-4.
In a Lustre ldiskfs file system, all the inodes are allocated on the MDT and OSTs
when the file system is first formatted. The total number of inodes on a formatted
MDT or OST cannot be easily changed, although it is possible to add OSTs with
additional space and corresponding inodes. Thus, the number of inodes created at
format time should be generous enough to anticipate future expansion.
When the file system is in use and a file is created, the metadata associated with that
file is stored in one of the pre-allocated inodes and does not consume any of the free
space used to store file data.
Note – By default, the ldiskfs file system used by Lustre servers to store user-data
objects and system data reserves 5% of space that cannot be used by Lustre.
Additionally, Lustre reserves up to 400 MB on each OST for journal use and a small
amount of space outside the journal to store accounting data for Lustre. This reserved
space is unusable for general storage. Thus, at least 400 MB of space is used on each
OST before any file object data is saved.
For example, if the average file size is 5 MB and you have 100 TB of usable OST
space, then you can calculate the minimum number of inodes as follows:
(100 TB * 1024 GB/TB * 1024 MB/GB) / 5 MB/inode = 20 million inodes
We recommend that you use at least twice the minimum number of inodes to allow
for future expansion and allow for an average file size smaller than expected. Thus,
the required space is:
4 KB/inode * 40 million inodes = 160 GB
If the average file size is small, 4 KB for example, Lustre is not very efficient as the
MDT uses as much space as the OSTs. However, this is not a common configuration
for Lustre.
Note – If the MDT is too small, this can cause all the space on the OSTs to be
unusable. Be sure to determine the appropriate size of the MDT needed to support
the file system before formatting the file system. It is difficult to increase the number
of inodes after the file system is formatted.
--mkfsoptions='backing fs options'
For other options to format backing ldiskfs filesystems, see the Linux man page for
mke2fs(8).
For example, use the following option to create one inode per 2048 bytes of file
system space.
--mkfsoptions="-i 2048"
To avoid mke2fs creating an unusable file system, do not specify the -i option with
an inode ratio below one inode per 1024 bytes. Instead, specify an absolute number
of inodes, using this option:
-N <number of inodes>
For example, by default, a 2 TB MDT will have 512M inodes. The largest
currently-supported file system size is 16 TB, which would hold 4B inodes, the
maximum possible number of inodes in a ldiskfs file system. With an MDS inode
ratio of 1024 bytes per inode, a 2 TB MDT would hold 2B inodes, and a 4 TB MDT
would hold 4B inodes.
To specify a larger inode size, use the -I <inodesize> option. We recommend you
do NOT specify a smaller-than-default inode size, as this can lead to serious
performance problems; and you cannot change this parameter after formatting the
file system. The inode ratio must always be larger than the inode size.
num_ost_inodes =
4 * <num_mds_inodes> * <default_stripe_count> / <number_osts>
You can specify the number of inodes on the OST file systems using the following
option to the --mkfs option:
-N <num_inodes>
Alternately, if you know the average file size, then you can specify the OST inode
count for the OST file systems using:
For example, if the average file size is 16 MB and there are, by default 4 stripes per
file, then --mkfsoptions='-i 1048576' would be appropriate.
For more details on formatting MDT and OST file systems, see Section 6.4,
“Formatting Options for RAID Devices” on page 6-3.
Maximum Stripe Count 160 This limit is hard-coded, but is near the
upper limit imposed by the underlying
ldiskfs file system.
Maximum Stripe Size < 4 GB The amount of data written to each object
before moving on to next object.
Minimum Stripe Size 64 KB Due to the 64 KB PAGE_SIZE on some 64-bit
machines, the minimum stripe size is set to
64 KB.
Maximum object size 2 TB The amount of data that can be stored in a
single object. The ldiskfs limit of 2TB for a
single file applies. Lustre allows 160 stripes
of 2 TB each.
Maximum number of OSTs 8150 The maximum number of OSTs is a constant
that can be changed at compile time. Lustre
has been tested with up to 4000 OSTs.
Maximum number of MDTs 1 Maximum of 1 MDT per file system, but a
single MDS can host multiple MDTs, each
one for a separate file system.
Maximum number of clients 131072 The number of clients is a constant that can
be changed at compile time.
Maximum size of a file system 64 PB Each OST or MDT can have a file system up
to 16 TB, regardless of whether 32-bit or
64-bit kernels are on the server.
You can have multiple OST file systems on a
single OSS node.
Maximum file size 16 TB on Individual files have a hard limit of nearly
32-bit 16 TB on 32-bit systems imposed by the
systems kernel memory subsystem. On 64-bit
systems this limit does not exist. Hence, files
320 TB on can be 64-bits in size. Lustre imposes an
64-bit additional size limit of up to the number of
systems stripes, where each stripe is 2 TB.
A single file can have a maximum of 160
stripes, which gives an upper single file
limit of 320 TB for 64-bit systems. The actual
amount of data that can be stored in a file
depends upon the amount of free space in
each OST on which the file is striped.
Maximum number of files or 10 million Lustre uses the ldiskfs hashed directory
subdirectories in a single files code, which has a limit of about 10 million
directory files depending on the length of the file
name. The limit on subdirectories is the
same as the limit on regular files.
Lustre is tested with ten million files in a
single directory.
Maximum number of files in the 4 billion The ldiskfs file system imposes an upper
file system limit of 4 billion inodes. By default, the
MDS file system is formatted with 4 KB of
space per inode, meaning 512 million inodes
per file system of 2 TB.
This can be increased initially, at the time of
MDS file system creation. For more
information, see Section 5.3, “Setting File
System Formatting Options” on
page 5-7.
Maximum length of a filename 255 bytes This limit is 255 bytes for a single filename,
(filename) the same as in an ldiskfs file system.
Maximum length of a pathname 4096 bytes The Linux VFS imposes a full pathname
(pathname) length of 4096 bytes.
Maximum number of open files None Lustre does not impose a maximum for the
for Lustre file systems number of open files, but the practical limit
depends on the amount of RAM on the
MDS. No "tables" for open files exist on the
MDS, as they are only linked in a list to a
given client's export. Each client process
probably has a limit of several thousands of
open files which depends on the ulimit.
The amount of memory used by the MDS is a function of how many clients are on
the system, and how many files they are using in their working set. This is driven,
primarily, by the number of locks a client can hold at one time. The number of locks
held by clients varies by load and memory availability on the server. Interactive
For example, for a single MDT on an MDS with 1,000 clients, 16 interactive nodes,
and a 2 million file working set (of which 400,000 files are cached on the clients):
Thus, the minimum requirement for a system with this configuration is at least 4 GB
of RAM. However, additional memory may significantly improve performance.
The same calculation applies to files accessed from the OSS as for the MDS, but the
load is distributed over many more OSSs nodes, so the amount of memory required
for locks, inode cache, etc. listed under MDS is spread out over the OSS nodes. as
shown in 5.2.3.1.
Per OSS DLM locks + filesystem metadata = 3520MB/6 OSS = 600MB (approx.)
This consumes about 1,400 MB just for the pre-allocated buffers, and an additional 2
GB for minimal file system and kernel usage. Therefore, for a non-failover
configuration, the minimum RAM would be 4 GB for an OSS node with two OSTs.
Adding additional memory on the OSS will improve the performance of reading
smaller, frequently-accessed files.
For a failover configuration, the minimum RAM would be at least 6 GB. For 4 OSTs
on each OSS in a failover configuration 10GB of RAM is reasonable. When the OSS is
not handling any failed-over OSTs the extra RAM will be used as a read cache.
As a reasonable rule of thumb, about 2 GB of base memory plus 1 GB per OST can be
used. In failover configurations, about 2 GB per OST is needed.
Lustre networks and routing are configured and managed by specifying parameters
to the Lustre Networking (lnet) module in /etc/modprobe.conf or
/etc/modprobe.conf.local (depending on your Linux distribution).
Note – We recommend that you use “dotted-quad” notation for IP addresses rather
than host names to make it easier to read debug logs and debug configurations with
multiple interfaces.
This chapter describes best practices for storage selection and file system options to
optimize perforance on RAID, and includes the following sections:
■ Selecting Storage for the MDT and OSTs
■ Reliability Best Practices
■ Performance Tradeoffs
■ Formatting Options for RAID Devices
■ Connecting a SAN to a Lustre File System
Note – It is strongly recommended that hardware RAID be used with Lustre. Lustre
currently does not support any redundancy at the file system level and RAID is
required to protect agains disk failure.
6-1
6.1 Selecting Storage for the MDT and OSTs
The Lustre architecture allows the use of any kind of block device as backend
storage. The characteristics of such devices, particularly in the case of failures, vary
significantly and have an impact on configuration choices.
For better performance, we recommend that you create RAID sets with 4 or 8 data
disks plus one or two parity disks. Using larger RAID sets will negatively impact
performance compared to having multiple independent RAID sets.
To maximize performance for small I/O request sizes, storage configured as RAID
1+0 can yield much better results but will increase cost or reduce capacity.
Backups of the metadata file systems are recommended. For details, see Chapter 17:
Backing Up and Restoring a File System.
If writeback cache is enabled, a file system check is required after the array loses
power. Data may also be lost because of this.
Therefore, we recommend against the use of writeback cache when data integrity is
critical. You should carefully consider whether the benefits of using writeback cache
outweigh the risks.
For RAID 5, RAID 6, or RAID 1+0 storage, specifying the following option to the
--mkfsoptions parameter option improves the layout of the file system metadata,
ensuring that no single disk contains all of the allocation bitmaps:
-E stride = <chunk_blocks>
For more information on how to override the defaults while formatting MDT or OST
file systems, see Section 5.3, “Setting File System Formatting Options” on page 5-7.
If the RAID configuration does not allow <chunk_blocks> to fit evenly into 1 MB,
select <chunkblocks>, such that <stripe_width_blocks> is close to 1 MB, but
not larger.
Run --reformat on the file system device (/dev/sdc), specifying the RAID
geometry to the underlying ldiskfs file system, where:
Example:
A RAID 6 configuration with 6 disks has 4 data and 2 parity disks. The
<chunk_blocks> <= 1024KB/4 = 256KB.
Because the number of data disks is equal to the power of 2, the stripe width is equal
to 1MB.
--mkfsoptions "<other options> -E stride=<chunk_blocks>, \
stripe_width=<stripe_width_blocks>"...
Lustre's default journal size is 400 MB. A journal size of up to 1 GB has shown
increased performance but diminishing returns are seen for larger journals.
Additionally, a copy of the journal is kept in RAM. Therefore, make sure you have
enough memory available to hold copies of all the journals.
The file system journal options are specified to mkfs.luster using the
--mkfsoptions parameter. For example:
To create an external journal, perform these steps for each OST on the OSS:
This chapter describes how to use multiple network interfaces in parallel to increase
bandwidth and/or redundancy. Topics include:
■ Network Interface Bonding Overview
■ Requirements
■ Bonding Module Parameters
■ Setting Up Bonding
■ Configuring Lustre with Bonding
■ Bonding References
7-1
7.1 Network Interface Bonding Overview
Bonding, also known as link aggregation, trunking and port trunking, is a method of
aggregating multiple physical network links into a single logical link for increased
bandwidth.
Several different types of bonding are available in Linux. All these types are referred
to as “modes,” and use the bonding kernel module.
Modes 0 to 3 allow load balancing and fault tolerance by using multiple interfaces.
Mode 4 aggregates a group of interfaces into a single virtual interface where all
members of the group share the same speed and duplex settings. This mode is
described under IEEE spec 802.3ad, and it is referred to as either “mode 4” or
“802.3ad.”
7.2 Requirements
The most basic requirement for successful bonding is that both endpoints of the
connection must be capable of bonding. In a normal case, the non-server endpoint is
a switch. (Two systems connected via crossover cables can also use bonding.) Any
switch used must explicitly handle 802.3ad Dynamic Link Aggregation.
The kernel must also be configured with bonding. All supported Lustre kernels have
bonding functionality. The network driver for the interfaces to be bonded must have
the ethtool functionality to determine slave speed and duplex settings. All recent
network drivers implement it.
# which ethtool
/sbin/ethtool
# ethtool eth0
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes: 10baseT/Half 10baseT/Full/
100baseT/Half 100baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
Advertised auto-negotiation: Yes
# ethtool eth1
Outgoing traffic is mapped across the slave interfaces according to the transmit hash
policy. For Lustre, we recommend that you set the xmit_hash_policy option to the
layer3+4 option for bonding. This policy uses upper layer protocol information if
available to generate the hash. This allows traffic to a particular network peer to span
multiple slaves, although a single connection does not span multiple slaves.
$ xmit_hash_policy=layer3+4
The miimon option enables users to monitor the link status. (The parameter is a time
interval in milliseconds.) It makes an interface failure transparent to avoid serious
network degradation during link failures. A reasonable default setting is 100
milliseconds; run:
$ miimon=100
/etc/sysconfig/network-scripts/ # vi /etc/sysconfig/ \
network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=192.168.10.79 # Use the free IP Address of your network
NETWORK=192.168.10.0
NETMASK=255.255.255.0
USERCTL=no
BOOTPROTO=none
ONBOOT=yes
# vi /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
# vi /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none
4. Set up the bond interface and its options in /etc/modprobe.conf. Start the slave
interfaces by your normal network method.
# vi /etc/modprobe.conf
# modprobe bonding
# ifconfig bond0 up
# ifenslave bond0 eth0 eth1
Note – You must modprobe the bonding module for each bonded interface. If you
wish to create bond0 and bond1, two entries in modprobe.conf are required.
The examples below are from RedHat systems. For setup use:
/etc/sysconfig/networking-scripts/ifcfg-* The website referenced below
includes detailed instructions for other configuration methods, instructions to use
DHCP with bonding, and other setup details. We strongly recommend you use this
website.
http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.0.3 (March 23, 2006)
ifconfig
bond0 Link encap:Ethernet HWaddr 4C:00:10:AC:61:E0
inet addr:192.168.10.79 Bcast:192.168.10.255 \
Mask:255.255.255.0
inet6 addr: fe80::4e00:10ff:feac:61e0/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:3091 errors:0 dropped:0 overruns:0 frame:0
TX packets:880 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:314203 (306.8 KiB) TX bytes:129834 (126.7 KiB)
# cat /etc/modprobe.conf
alias eth0 8139too
alias scsi_hostadapter sata_via
alias scsi_hostadapter1 usb-storage
alias snd-card-0 snd-via82xx
options snd-card-0 index=0
options snd-via82xx index=0
alias bond0 bonding
options bond0 mode=balance-alb miimon=100
options lnet networks=tcp
alias eth1 via-rhine
# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
NETMASK=255.255.255.0
IPADDR=192.168.10.79 # (Assign here the IP of the bonded interface.)
ONBOOT=yes
USERCTL=no
ifcfg-ethx
# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
DEVICE=eth0
HWADDR=4c:00:10:ac:61:e0
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
IPV6INIT=no
PEERDNS=yes
MASTER=bond0
SLAVE=yes
In the following example, the bond0 interface is the master (MASTER) while eth0 and
eth1 are slaves (SLAVE).
Note – All slaves of bond0 have the same MAC address (Hwaddr) – bond0. All
modes, except TLB and ALB, have this MAC address. TLB and ALB require a unique
MAC address for each slave.
8-1
8.1 Preparing to Install the Lustre Software
If you are using a supported Linux distribution and architecture, you can install
Lustre from downloaded packages (RPMs). For a list of supported configurations, see
the topic Lustre_2.0 on the Lustre wiki.
If you are not using a supported configuration, you can install Lustre directly from
the source code. For more information on this installation method, see Chapter 29:
Installing Lustre from Source Code.
At least one Lustre RPM must be installed on each server and on each client in a
Lustre file system. TABLE 8-1 lists required Lustre packages and indicates where they
are to be installed. Some Lustre packages are installed on Lustre servers (MGS, MDS,
and OSSs), some are installed on Lustre clients, and some are installed on all Lustre
nodes
Install on Install on
Lustre Package Description servers* clients
Lustre utilities:
lustre-<ver> Lustre utilities package. This
includes userspace utilities to
configure and run Lustre. X*
lustre-client-<ver> Lustre utilities for clients. X
lustre-ldiskfs-<ver> Lustre-patched backing file
system kernel module package X
for the ldiskfs file system.
e2fsprogs-<ver> Utilities package used to
maintain the ldiskfs backing file X
system.
* Installing a patched kernel on a client node is not required. However, if a client node will be used as both a
client and a server, or if you want to install the same kernel on all nodes for any reason, install the server pack-
ages designated with an asterisk (*) on the client node.
In all supported Lustre installations, a patched kernel must be run on each server,
including the the MGS, the MDS, and all OSSs. Running a patched kernel on a Lustre
client is only required if the client will be used for multiple purposes, such as
running as both a client and an OST or if you want to use the same kernel on all
nodes.
Lustre RPM packages are available on the Lustre download site. They must be
installed in the order described in Section 8.2, “Lustre Installation Procedure” on
page 8-6.
■ Perl - Various userspace utilities are written in Perl. Any recent version of Perl
will work with Lustre.
For more information about debugging tools, see the topic Debugging Lustre on the
Lustre wiki.
Note – It is not recommended that you use the rpm -Uvh command to install a
kernel, because this may leave you with an unbootable system if the new kernel
doesn’t work for some reason.
i. Verify that the bootloader configuration file has been updated with an
entry for the patched kernel.
Before you can boot to a new distribution or kernel, there must be an entry
for it in the bootloader configuration file. Often this is added automatically
when the kernel RPM is installed.
selinux=0
Note – The rpm command options --force or --nodeps should not be used to
install or update the Lustre-specific e2fsprogs package. If errors are reported, file a
bug (for instructions see the topic Reporting Bugs on the Lustre wiki.
f. (Optional) To add optional packages to your Lustre file system, install them
now.
Optional packages include file system creation and repair tools, debugging
tools, test programs and scripts, Linux kernel and Lustre source code, and other
packages. A complete list of optional packages for your platform is provided on
the Lustre download site.
i. Verify that the bootloader configuration file has been updated with an
entry for the patched kernel.
Before you can boot to a new distribution or kernel, there must be an entry
for it in the bootloader configuration file. Often this is added automatically
when the kernel RPM is installed.
selinux=0
This chapter describes how to configure Lustre Networking (LNET). It includes the
following sections:
■ Overview of LNET Module Parameters
■ Setting the LNET Module networks Parameter
■ Setting the LNET Module ip2nets Parameter
■ Setting the LNET Module routes Parameter
■ Testing the LNET Configuration
■ Configuring the Router Checker
■ Best Practices for LNET Options
LNET will, by default, use the first TCP/IP interface it discovers on a system (eth0).
If this network configuration is sufficient, you do not need to configure LNET. LNET
configuration is required if you are using Infiniband or multiple Ethernet interfaces.
9-1
9.1 Overview of LNET Module Parameters
LNET kernel module (lnet) parameters specify how LNET is to be configured to
work with Lustre, including which NICs will be configured to work with Lustre and
the routing to be used with Lustre.
To specify the network interfaces that are to be used for Lustre, set either the
networks parameter or the ip2nets parameter (only one of these parameters can
be used at a time):
■ networks - Specifies the networks to be used.
■ ip2nets - Lists globally-available networks, each with a range of IP addresses.
LNET then identifies locally-available networks through address list-matching
lookup.
See Section 9.2, “Setting the LNET Module networks Parameter” on page 9-4 and
Section 9.3, “Setting the LNET Module ip2nets Parameter” on page 9-6 for more
details.
See Section 9.4, “Setting the LNET Module routes Parameter” on page 9-7 for more
details.
A router checker can be configured to enable Lustre nodes to detect router health
status, avoid routers that appear dead, and reuse those that restore service after
failures. See Section 9.6, “Configuring the Router Checker” on page 9-8 for more
details.
For a complete reference to the LNET module parameters, see Section 35.2.1, “LNET
Options” on page 35-3.
Note – We recommend that you use “dotted-quad” notation for IP addresses rather
than host names to make it easier to read debug logs and debug configurations with
multiple interfaces.
Examples are:
10.67.73.200@tcp0
10.67.75.100@o2ib
The first entry above identifes a TCP/IP node, while the second entry identifies an
InfiniBand node.
When a mount command is run on a client, the client uses the NID of the MDS to
retrieve configuration information. If an MDS has more than one NID, the client
should use the appropriate NID for its local network.
To determine the appropriate NID to specify in the mount command, use the lctl
command. To display MDS NIDs, run on the MDS :
lctl list_nids
To determine if a client can reach the MDS using a particular NID, run on the client:
This example specifies that a Lustre node will use a TCP/IP interface and an
InfiniBand interface:
This example specifies that the Lustre node will use the TCP/IP interface eth1:
When more than one interface is available during the network setup, Lustre chooses
the best route based on the hop count. Once the network connection is established,
Lustre expects the network to stay connected. In a Lustre network, connections do
not fail over to another interface, even if multiple interfaces are available on the same
node.
Note – LNET lines in modprobe.conf are only used by the local node to determine
what to call its interfaces. They are not used for routing decisions.
Note – By default, Lustre ignores the loopback (lo0) interface. Lustre does not
ignore IP addresses aliased to the loopback. If you alias IP addresses to the loopback
interface, you must specify all Lustre networks using the LNET networks parameter.
Note – If the server has multiple interfaces on the same subnet, the Linux kernel will
send all traffic using the first configured interface. This is a limitation of Linux, not
Lustre. In this case, network interface bonding should be used. For more information
about network interface bonding, see Chapter 7: Setting Up Network Interface
Bonding.
Note that the IP address patterns listed in the ip2nets option are only used to
identify the networks that an individual node should instantiate. They are not used
by LNET for any other communications purpose.
For the example below, the nodes in the network have these IP addresses:
■ Server svr1: eth0 IP address 192.168.0.2, IP over Infiniband (o2ib) address
132.6.1.2.
■ Server svr2: eth0 IP address 192.168.0.4, IP over Infiniband (o2ib) address
132.6.1.4.
■ TCP clients have IP addresses 192.168.0.5-255.
■ Infiniband clients have IP over Infiniband (o2ib) addresses 132.6.[2-3].2, .4, .6, .8.
The following entry is placed in the modprobe.conf file on each server and client:
The order of LNET entries is important when configuring servers. If a server node
can be reached using more than one network, the first network specified in
modprobe.conf will be used.
Because svr1 and svr2 match the first rule, LNET uses eth0 for tcp0 on those
machines. (Although svr1 and svr2 also match the second rule, the first matching
rule for a particular network is used).
The [2-8/2] format indicates a range of 2-8 stepped by 2; that is 2,4,6,8. Thus, the
clients at 132.6.3.5 will not find a matching o2ib network.
This example specifies bi-directional routing in which TCP clients can reach Lustre
resources on the IB networks and IB servers can access the TCP networks:
All LNET routers that bridge two networks are equivalent. They are not configured
as primary or secondary, and the load is balanced across all available routers.
The number of LNET routers is not limited. Enough routers should be used to handle
the required file serving bandwidth plus a 25 percent margin for headroom.
modprobe lnet
lctl network configure
The router checker obtains the following information from each router:
■ Time the router was disabled
■ Elapsed disable time
If the router checker does not get a reply message from the router within
router_ping_timeout seconds, it considers the router to be down.
If 100 packets have been sent successfully through a router, the sent-packets counter
for that router will have a value of 100.
Added quotes may confuse some distributions. Messages such as the following may
indicate an issue related to added quotes:
lnet: Unknown parameter ‘'networks'
Including comments
Place the semicolon terminating a comment immediately after the comment. LNET silently
ignores everything between the # character at the beginning of the comment and the
next semicolon.
Do not add an excessive number of comments. The Linux kernel limits the length of
character strings used in module options (usually to 1KB, but this may differ
between vendor kernels). If you exceed this limit, errors result and the specified
configuration may not be processed correctly.
Configuring Lustre
10-1
10.1 Configuring a Simple Lustre File System
A Lustre system can be set up in a variety of configurations by using the
administrative utilities provided with Lustre. The procedure below shows how to to
configure a simple Lustre file system consisting of a combined MGS/MDS, one OSS
with two OSTs, and a client. For an overview of the entire Lustre installation
procedure, see Chapter 4: Installation Overview.
The following optional steps should also be completed, if needed, before the Lustre
software is configured:
■ Set up a hardware or software RAID on block devices to be used as OSTs or MDTs. For
information about setting up RAID, see the documentation for your RAID
controller or Chapter 6: Configuring Storage on a Lustre File System.
■ Set up network interface bonding on Ethernet interfaces. For information about setting
up network interface bonding, see Chapter 7: Setting Up Network Interface
Bonding.
■ Set lnet module parameters to specify how Lustre Networking (LNET) is to be
configured to work with Lustre and test the LNET configuration. LNET will, by default,
use the first TCP/IP interface it discovers on a system. If this network
configuration is sufficient, you do not need to configure LNET. LNET
configuration is required if you are using Infiniband or multiple Ethernet
interfaces.
For information about configuring LNET, see Chapter 9: Configuring Lustre
Networking (LNET). For information about testing LNET, see Chapter 23: Testing
Lustre Network Performance (LNET Self-Test).
■ Run the benchmark script sgpdd_survey to determine baseline performance of your
hardware. Benchmarking your hardware will simplify debugging performance
issues that are unrelated to Lustre and ensure you are getting the best possible
performance with your installation. For information about running
sgpdd_survey, see Section 24.2, “Testing I/O Performance of Raw Hardware
(sgpdd_survey)” on page 24-3.
Note – The sgpdd_survey script overwrites the device being tested so it must be
run before the OSTs are configured.
1. Create a combined MGS/MDT file system on a block device. On the MDS node,
run:
Note – If you plan to generate multiple file systems, the MGS should be created
separately on its own dedicated block device, by running:
2. Mount the combined MGS/MDT file system on the block device. On the MDS
node, run:
Note – If you have created and MGS and an MDT on separate block devices, mount
them both.
4. Mount the OST. On the OSS node where the OST was created, run:
5. Mount the Lustre file system on the client. On the client node, run:
6. Verify that the file system started and is working correctly. Do this by running
the lfs df, dd and ls commands on the client node.
Note – If you have a problem mounting the file system, check the syslogs on the
client and all the servers for errors and also check the network settings. A common
issue with newly-installed systems is that hosts.deny or firewall rules may prevent
connections on port 988.
Common
Parameters Value Description
network type TCP/IP Network type used for Lustre file system temp
MGS/MDS node
mount point /mnt/mdt Mount point for the mdt1 block device (/dev/sdb) on the
MGS/MDS node
OSS node oss1 First OSS node in Lustre file system temp
block device /dev/sdc Block device for the first OSS node (oss1)
mount point /mnt/ost1 Mount point for the ost1 block device (/dev/sdc) on the
oss1 node
OSS node oss2 Second OSS node in Lustre file system temp
block device /dev/sdd Block device for the second OSS node (oss2)
mount point /mnt/ost2 Mount point for the ost2 block device (/dev/sdd) on the
oss2 node
Client node
mount point /lustre Mount point for Lustre file system temp on the client1
node
1. Create a combined MGS/MDT file system on the block device. On the MDS
node, run:
[root@mds /]# mkfs.lustre --fsname=temp --mgs --mdt /dev/sdb
This command generates this output:
Permanent disk data:
Target: temp-MDTffff
Index: unassigned
Lustre FS: temp
Mount type: ldiskfs
Flags: 0x75
(MDT MGS needs_index first_time update )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: mdt.group_upcall=/usr/sbin/l_getgroups
2. Mount the combined MGS/MDT file system on the block device. On the MDS
node, run:
[root@mds /]# mount -t lustre /dev/sdb /mnt/mdt
This command generates this output:
Lustre: temp-MDT0000: new disk, initializing
Lustre: 3009:0:(lproc_mds.c:262:lprocfs_wr_group_upcall()) \
temp-MDT0000: group upcall set to /usr/sbin/l_getgroups
Lustre: temp-MDT0000.mdt: set parameter \
group_upcall=/usr/sbin/l_getgroups
Lustre: Server temp-MDT0000 on device /dev/sdb has started
b. Mount ost1 on the OSS on which it was created. On oss1 node, run:
root@oss1 /] mount -t lustre /dev/sdc /mnt/ost1
The command generates this output:
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
Lustre: temp-OST0000: new disk, initializing
Lustre: Server temp-OST0000 on device /dev/sdb has started
Shortly afterwards, this output appears:
b. Mount ost2 on the OSS on which it was created. On oss2 node, run:
root@oss2 /] mount -t lustre /dev/sdd /mnt/ost2
The command generates this output:
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
Lustre: temp-OST0000: new disk, initializing
Lustre: Server temp-OST0000 on device /dev/sdb has started
Shortly afterwards, this output appears:
Lustre: temp-OST0000: received MDS connection from 10.2.0.1@tcp0
Lustre: MDS temp-MDT0000: temp-OST0000_UUID now active, resetting
orphans
5. Mount the Lustre file system on the client. On the client node, run:
root@client1 /] mount -t lustre 10.2.0.1@tcp0:/temp /lustre
This command generates this output:
Lustre: Client temp-client has started
TABLE 10-1
File Layout
Parameter Default Description
Use the lfs setstripe command described in Section 18.3, “Setting the File
Layout/Striping Configuration (lfs setstripe)” on page 18-4 to change the file
layout configuration.
For examples using these utilities, see the topic Chapter 36: System Configuration
Utilities on the Lustre wiki.
The lfs utility is usful for configuring and querying a variety of options related to
files. For more information, see Section 32.1, “lfs” on page 32-2.
Note – Some sample scripts are included in the directory where Lustre is installed. If
you have installed the Lustre source code, the scripts are located in the
lustre/tests sub-directory. These scripts enable quick setup of some simple
standard Lustre configurations.
This chapter describes how to configure Lustre failover using the Heartbeat cluster
infrastructure daemon. It includes:
■ Creating a Failover Environment
■ Setting up High-Availability (HA) Software with Lustre
11-1
11.1 Creating a Failover Environment
Lustre provides failover mechanisms only at the file system level. No failover
functionality is provided for system-level components, such as node failure detection
or power control, as would typically be provided in a complete failover solution.
Additional tools are also needed to provide resource fencing, control and monitoring.
Shoot The Other Node In The HEAD (STONITH), is a set of power management tools
provided with the Linux-HA package. STONITH has native support for many power
control devices and is extensible. It uses expect scripts to automate control.
http://sourceforge.net/projects/powerman
https://computing.llnl.gov/linux/powerman.html
https://computing.llnl.gov/linux/powerman.html
Part III provides information about tools and procedures to use to administer a
Lustre file system. You will find information in this section about:
■ Lustre Monitoring
■ Lustre Operations
■ Lustre Maintenance
■ Managing Lustre Networking (LNET)
■ Upgrading Lustre
■ Backing Up and Restoring a File System
■ Managing File Striping and Free Space
■ Managing the File System and I/O
■ Managing Failover
■ Configuring and Managing Quotas
■ Managing Lustre Security
Tip – The starting point for administering Lustre is to monitor all logs and console
logs for system health:
- Monitor logs on all servers and all clients.
- Invest in tools that allow you to condense logs from multiple systems.
- Use the logging resources provided by Linux.
CHAPTER 12
Lustre Monitoring
This chapter provides information on monitoring Lustre and includes the following
sections:
■ Lustre Changelogs
■ Lustre Monitoring Tool
■ CollectL
■ Other Monitoring Options
12-1
12.1 Lustre Changelogs
The changelogs feature records events that change the file system namespace or file
metadata. Changes such as file creation, deletion, renaming, attribute changes, etc.
are recorded with the target and parent file identifiers (FIDs), the name of the target,
and a timestamp. These records can be used for a variety of purposes:
■ Capture recent changes to feed into an archiving system.
■ Use changelog entries to exactly replicate changes in a file system mirror.
■ Set up "watch scripts" that take action on certain events or directories.
■ Maintain a rough audit trail (file/directory changes with timestamps, but no user
information).
Value Description
lctl changelog_register
Because changelog records take up space on the MDT, the system administration
must register changelog users. The registrants specify which records they are "done
with", and the system purges up to the greatest common record.
Changelog entries are not purged beyond a registered user’s set point (see lfs
changelog_clear).
lfs changelog
To display the metadata changes on an MDT (the changelog records), run:
When all changelog users are done with records < X, the records are deleted.
lctl changelog_deregister
To deregister (unregister) a changelog user, run:
rec#
operation_type(numerical/text)
timestamp
datestamp
flags
t=target_FID
p=parent_FID
target_name
For example:
The deregistration operation clears all changelog records for the specified user (cli).
mdd.lustre-MDT0000.changelog_users=current index: 8
ID index
cl2 8
mdd.lustre-MDT0000.changelog_mask=
MARK CREAT MKDIR HLINK SLINK MKNOD UNLNK RMDIR RNMFM RNMTO OPEN CLOSE
IOCTL TRUNC SATTR XATTR HSM
mdd.lustre-MDT0000.changelog_mask=HLINK
$ mkdir /mnt/lustre/mydir/foo
$ cp /etc/hosts /mnt/lustre/mydir/foo/file
$ ln /mnt/lustre/mydir/foo/file /mnt/lustre/mydir/myhardlink
Only item types that are in the mask show up in the changelog.
http://code.google.com/p/lmt/
http://collectl.sourceforge.net
http://collectl.sourceforge.net/Tutorial-Lustre.html
Another option is to script a simple monitoring solution that looks at various reports
from ipconfig, as well as the procfs files generated by Lustre.
Lustre Operations
Once you have the Lustre file system up and running, you can use the procedures in
this section to perform these basic Lustre administration tasks:
■ Mounting by Label
■ Starting Lustre
■ Mounting a Server
■ Unmounting a Server
■ Specifying Failout/Failover Mode for OSTs
■ Handling Degraded OST RAID Arrays
■ Running Multiple Lustre File Systems
■ Setting and Retrieving Lustre Parameters
■ Specifying NIDs and Failover
■ Erasing a File System
■ Reclaiming Reserved Disk Space
■ Replacing an Existing OST or MDS
■ Identifying To Which Lustre File an OST Object Belongs
13-1
13.1 Mounting by Label
The file system name is limited to 8 characters. We have encoded the file system and
target information in the disk label, so you can mount by label. This allows system
administrators to move disks around without worrying about issues such as SCSI
disk reordering or getting the /dev/device wrong for a shared target. Soon, file
system naming will be made as fail-safe as possible. Currently, Linux disk labels are
limited to 16 characters. To identify the target within the file system, 8 characters are
reserved, leaving 8 characters for the file system name:
<fsname>-MDT0000 or <fsname>-OST0a19
Although the file system name is internally limited to 8 characters, you can mount
the clients at any mount point, so file system users are not subjected to short names.
Here is an example:
Note – If an OST is added to a Lustre file system with a combined MGS/MDT, then
the startup order changes slightly; the MGS must be started first because the OST
needs to write its configuration data to it. In this scenario, the startup order is
MGS/MDT, then OSTs, then the clients.
mount -t lustre
In this example, the MDT, an OST (ost0) and file system (testfs) are mounted.
LABEL=testfs-MDT0000 /mnt/test/mdt lustre defaults,_netdev,noauto 0 0
In general, it is wise to specify noauto and let your high-availability (HA) package
manage when to mount the device. If you are not using failover, make sure that
networking has been started before mounting a Lustre server. RedHat, SuSE, Debian
(and perhaps others) use the _netdev flag to ensure that these disks are mounted
after the network is up.
Caution – Do not do this when the client and OSS are on the same node, as memory
pressure between the client and OSS can lead to deadlocks.
$ umount /mnt/test
Gracefully stopping a server with the umount command preserves the state of the
connected clients. The next time the server is started, it waits for clients to reconnect,
and then goes through the recovery procedure.
If the force (-f) flag is used, then the server evicts all clients and stops WITHOUT
recovery. Upon restart, the server does not wait for recovery. Any currently
connected clients receive I/O errors until they reconnect.
Note – If you are using loopback devices, use the -d flag. This flag cleans up loop
devices and can always be safely specified.
By default, the Lustre file system uses failover mode for OSTs. To specify failout
mode instead, run this command:
In this example, failout mode is specified for the OSTs on MGS uml1, file system
testfs.
Caution – Before running this command, unmount all OSTs that will be affected by
the change in the failover/failout mode.
Note – After initial file system configuration, use the tunefs.lustre utility to
change the failover/failout mode. For example, to set the failout mode, run:
A parameter for each OST, called degraded, specifies whether the OST is running in
degraded mode or not.
If the OST is remounted due to a reboot or other condition, the flag resets to 0.
Note – The MDT, OSTs and clients in the new file system must share the same name
(prepended to the device name). For example, for a new file system named foo, the
MDT and two OSTs would be named foo-MDT0000, foo-OST0000, and
foo-OST0001.
For example, to mount a client on file system foo at mount point /mnt/lustre1,
run:
Note – If a client(s) will be mounted on several file systems, add the following line
to /etc/xattr.conf file to avoid problems when files are moved between the file
systems: lustre.* skip
Note – The MGS is universal; there is only one MGS per Lustre installation, not per
file system.
Note – There is only one file system per MDT. Therefore, specify --mdt --mgs on
one file system and --mdt --mgsnode=<MGS node NID> on the other file systems.
For more details about creating a file system,see Chapter 10: Configuring Lustre. For
more details about mkfs.lustre, see Chapter 36: System Configuration Utilities.
For more details about tunefs.lustre, see Chapter 36: System Configuration
Utilities.
Note – The lctl list_param command enables users to list all parameters that
can be set. See Section 13.8.3.3, “Listing Parameters” on page 13-11.
For more details about the lctl command, see the examples in the sections below
and Chapter 36: System Configuration Utilities.
For example:
osc.myth-OST0000-osc.max_dirty_mb=32
osc.myth-OST0001-osc.max_dirty_mb=32
osc.myth-OST0002-osc.max_dirty_mb=32
osc.myth-OST0003-osc.max_dirty_mb=32
osc.myth-OST0004-osc.max_dirty_mb=32
<obd|fsname>.<obdtype>.<proc_file_name>=<value>)
Caution – Parameters specified with the lctl conf_param command are set
permanently in the file system’s configuration file on the MGS.
The following arguments are available for the lctl list_param command.
-F Add '/', '@' or '=' for directories, symlinks and writeable files, respectively
For example:
osc.myth-OST0000-osc-ffff88006dd20000.filesfree=217623
osc.myth-OST0001-osc-ffff88006dd20000.filesfree=5075042
osc.myth-OST0002-osc-ffff88006dd20000.filesfree=3762034
osc.myth-OST0003-osc-ffff88006dd20000.filesfree=91052
osc.myth-OST0004-osc-ffff88006dd20000.filesfree=129651
lctl list_nids
This displays the server's NIDs (networks configured to work with Lustre).
This example has a combined MGS/MDT failover pair on uml1 and uml2, and a OST
failover pair on uml3 and uml4. There are corresponding Elan addresses on uml1 and
uml2.
Note – If you have an MGS or MDT configured for failover, perform these steps:
1. On the OST, list the NIDs of all MGS nodes at mkfs time.
$ "mkfs.lustre –reformat"
If you are using a separate MGS and want to keep other file systems defined on that
MGS, then set the writeconf flag on the MDT for that file system. The writeconf
flag causes the configuration logs to be erased; they are regenerated the next time the
servers start.
$ umount /mnt/lustre
2. Erase the file system and, presumably, replace it with another file system, run:
3. If you have a separate MGS (that you do not want to reformat), then add the
"writeconf" flag to mkfs.lustre on the MDT, run:
Note – If you have a combined MGS/MDT, reformatting the MDT reformats the
MGS as well, causing all configuration information to be lost; you can start building
your new file system. Nothing needs to be done with old disks that will not be part
of the new file system, just do not mount them.
You do not need to shut down Lustre before running this command or restart it
afterwards.
1. On the OST (as root), run debugfs to display the file identifier (FID) of the file
associated with the object.
For example, if the object is 34976 on /dev/lustre/ost_test2, the debug command
is:
# debugfs -c -R "stat /O/0/d$((34976 %32))/34976" /dev/lustre/ost_test2
The command output is:
e2001100000000002543c18700000000a0880000000000000000000000000000
struct osd_inode_id {
3. On the MDT (as root), use debugfs to find the file associated with the inode.
Note – Debugfs' ''ncheck'' is a brute-force search that may take a long time to
complete.
Note – To find the Lustre file from a disk LBA, follow the steps listed in the
document at this URL: http://smartmontools.sourceforge.net/badblockhowto.html.
Then, follow the steps above to resolve the Lustre filename.
Lustre Maintenance
Once you have the Lustre file system up and running, you can use the procedures in
this section to perform these basic Lustre maintenance tasks:
■ Working with Inactive OSTs
■ Finding Nodes in the Lustre File System
■ Mounting a Server Without Lustre Service
■ Regenerating Lustre Configuration Logs
■ Changing a Server NID
■ Adding a New OST to a Lustre File System
■ Removing and Restoring OSTs
■ Aborting Recovery
■ Determining Which Machine is Serving an OST
■ Changing the Address of a Failover Node
14-1
14.1 Working with Inactive OSTs
To mount a client or an MDT with one or more inactive OSTs, run commands similar
to this:
To activate an inactive OST on a live client or MDT, use the lctl activate
command on the OSC device. For example:
To get a list of all Lustre nodes, run this command on the MGS:
# cat /proc/fs/lustre/mgs/MGS/live/*
# cat /proc/fs/lustre/lov/<fsname>-mdtlov/target_obd
In this example, the combined MGS/MDT is testfs-MDT0000 and the mount point
is mnt/test/mdt.
The writeconf command is destructive to some configuration items (i.e., OST pools
information and items set via conf_param), and should be used with caution. To
avoid problems:
■ Shut down the file system before running the writeconf command
■ Run the writeconf command on all servers (MDT first, then OSTs)
■ Start the file system in this order:
■ MGS (or the combined MGS/MDT)
■ MDT
■ OSTs
■ Lustre clients
Caution – The OST pools feature enables a group of OSTs to be named for file
striping purposes. If you use OST pools, be aware that running the writeconf
command erases all pools information (as well as any other parameters set via lctl
conf_param). We recommend that the pools definitions (and conf_param settings)
be executed via a script, so they can be reproduced easily after a writeconf is
performed.
c. If the NID on the MGS was changed, communicate the new MGS location to
each server. Run:
tunefs.lustre --erase-param --mgsnode=<new_nid(s)> --writeconf /dev/..
You may want to remove (deactivate) an OST and prevent new files from being
written to it in several situations:
■ Hard drive has failed and a RAID resync/rebuild is underway
■ OST is nearing its space capacity
You may want to deactivate an OST and prevent new files from being written to it in
several situations:
■ OST is nearing its space capacity
■ Hard drive has failed and a RAID resync/rebuild is underway
■ OST storage has failed permanently
When removing an OST, remember that the MDT does not communicate directly
with OSTs. Rather, each OST has a corresponding OSC which communicates with the
MDT. It is necessary to determine the device number of the OSC that corresponds to
the OST. Then, you use this device number to deactivate the OSC on the MDT.
1. For the OST to be removed, determine the device number of the corresponding
OSC on the MDT.
a. List all OSCs on the node, along with their device numbers. Run:
b. Determine the device number of the OSC that corresponds to the OST to be
removed.
Note – Do not deactivate the OST on the clients. Do so causes errors (EIOs), and the
copy out to fail.
3. Discover all files that have objects residing on the deactivated OST.
Depending on whether the deactivated OST is available or not, the data from that
OST may be migrated to other OSTs, or may need to be restored from backup.
a. If the OST is still online and available, find all files with objects on the deactivated
OST, and copy them to other OSTs in the file system to:
Note – This setting is only temporary and will be reset if the clients or MDS are
rebooted. It needs to be run on all clients.
Note – A removed OST still appears in the file system; do not create a new OST with
the same name.
To replace an OST that was removed from service due to corruption or hardware
failure, the file system needs to be formatted for Lustre, and the Lustre configuration
should be restored, if available.
If the OST configuration files were not backed up, due to the OST file system being
completely inaccessible, it is still possible to replace the failed OST with a new one at
the same OST index.
If the OST was temporarily deactivated, it needs to be reactivated on the MDS and
clients.
Note – The recovery process is blocked until all OSTs are available.
To identify the NID that is serving a specific OST, run one of the following
commands on a client (you do not need to be a root user):
For example:
- OR -
This chapter describes some tools for managing Lustre Networking (LNET) and
includes the following sections:
■ Updating the Health Status of a Peer or Router
■ Starting and Stopping LNET
■ Multi-Rail Configurations with LNET
■ Load Balancing with InfiniBand
15-1
15.1 Updating the Health Status of a Peer or
Router
There are two mechanisms to update the health status of a peer or a router:
■ LNET can actively check health status of all routers and mark them as dead or
alive automatically. By default, this is off. To enable it set auto_down and if
desired check_routers_before_use. This initial check may cause a pause
equal to router_ping_timeout at system startup, if there are dead routers in
the system.
■ When there is a communication error, all LNDs notify LNET that the peer (not
necessarily a router) is down. This mechanism is always on, and there is no
parameter to turn it off. However, if you set the LNET module parameter
auto_down to 0, LNET ignores all such peer-down notifications.
$ modprobe lnet
$ lctl network up
$ lctl list_nids
This command tells you the network(s) configured to work with Lustre
If the networks are not correctly setup, see the modules.conf "networks=" line and
make sure the network layer modules are correctly installed and configured.
This command takes the "best" NID from a list of the NIDs of a remote host. The
"best" NID is the one that the local node uses when trying to communicate with the
remote node.
Note – Attempting to remove Lustre modules prior to stopping the network may
result in a crash or an LNET hang. if this occurs, the node must be rebooted (in most
cases). Make sure that the Lustre network and Lustre are stopped prior to unloading
the modules. Be extremely careful using rmmod -f.
1. Multi-rail configurations are only supported by o2iblnd; other IB LNDs do not support multiple interfaces.
2. Run the modprobe lnet command and create a combined MGS/MDT file
system.
The following commands create the MGS/MDT file system and mount the servers
(MGS/MDT and OSS).
modprobe lnet
modprobe lnet
mount -t lustre
192.168.10.101@o2ib0,192.168.10.102@o2ib1:/mds/client /mnt/lustre
As an example, consider a two-rail IB cluster running the OFA stack (OFED) with
these IPoIB address assignments.
ib0 ib1
Servers 192.168.0.* 192.168.1.*
Clients 192.168.[2-127].* 192.168.[128-253].*
This configuration gives every server two NIDs, one on each network, and statically
load-balances clients between the rails.
■ A single client that must get two rails of bandwidth, and it does not matter if the
maximum aggregate bandwidth is only (# servers) * (1 rail).
ip2nets=" o2ib0(ib0) 192.168.[0-1].[0-252/2] #even servers;\
o2ib1(ib1) 192.168.[0-1].[1-253/2] #odd servers;\
o2ib0(ib0),o2ib1(ib1) 192.168.[2-253].* #clients"
This configuration includes two additional proxy o2ib networks to work around
Lustre's simplistic NID selection algorithm. It connects "even" clients to "even"
servers with o2ib0 on rail0, and "odd" servers with o2ib3 on rail1. Similarly, it
connects "odd" clients to "odd" servers with o2ib1 on rail0, and "even" servers with
o2ib2 on rail1.
Upgrading Lustre
This chapter describes Lustre interoperability and how to upgrade to Lustre 2.0, and
includes the following sections:
■ Lustre Interoperability
■ Upgrading Lustre 1.8.x to 2.0
16-1
16.1 Lustre Interoperability
Lustre 2.0 is built on a new architectural code base, which is different than the one
used with Lustre 1.8. These architectural changes require existing Lustre 1.8.x users
to follow a slightly different procedure to upgrade to Lustre 2.0 - requiring clients to
be unmounted and the file system be shut down. Once the servers are upgraded and
restarted, then the clients can be remounted. After the upgrade, Lustre 2.0 servers
can interoperate with compatible 1.8 clients and servers. Lustre 2.0 does not support
2.0 clients interoperating with 1.8 servers.
Note – Lustre 1.8 clients support a mix of 1.8 and 2.0 OSTs, not all OSSs need to be
upgraded at the same time.
Note – Lustre 2.0 is compatible with version 1.8.4 and above. If you are planning a
heterogeneous environment (mixed 1.8 and 2.0 servers), make sure that version 1.8.4
is installed on the client and server nodes that are not upgraded to 2.0.
Note – Although the Lustre 1.8 to 2.0 upgrade path has been tested, for best results
we recommend performing a fresh Lustre 2.0 installation, rather than upgrading
from 1.8 to 2.0.
Tip – In a Lustre upgrade, the package install and file system unmount steps are
reversible; you can do either step first. To minimize downtime, this procedure first
performs the 2.0 package installation, and then unmounts the file system.
2. If any Lustre nodes will not be upgraded to 2.0, make sure that these client and
server nodes are at version 1.8.4.
Lustre 2.0 is compatible with version 1.8.4 and above. If you are planning a
heterogeneous environment (mixed 1.8 and 2.0 clients and servers), make sure that
version 1.8.4 is installed on nodes that are not upgraded to 2.0.
3. Install the 2.0 packages on the Lustre servers and, optionally, the clients.
Some or all servers can be upgraded. Some or all clients can be upgraded.
For help determining where to install a specific package, see TABLE 8-1 (Lustre
packages, descriptions and installation guidance).
$ rpm -ivh
kernel-lustre-smp-<ver> \
kernel-ib-<ver> \
lustre-modules-<ver> \
lustre-ldiskfs-<ver>
c. Unmount the OSTs (be sure to unmount all OSTs). On each OSS node, run:
5. Unload the old Lustre modules by rebooting the node or manually removing
the Lustre modules.
Run lustre_rmmod several times and use lsmod to check the currently loaded
modules.
a. Mount the OSTs (be sure to mount all OSTs). On each OSS node, run:
c. Mount the file system on the clients. On each client node, run:
If you have a problem upgrading Lustre, contact us via the Bugzilla bug tracker.
Lustre provides backups at the file system-level, device-level and file-level. This
chapter describes how to backup and restore on Lustre, and includes the following
sections:
■ Backing up a File System
■ Backing Up and Restoring an MDS or OST (Device Level)
■ Making a File-Level Backup of an OST File System
■ Restoring a File-Level Backup
■ Using LVM Snapshots with Lustre
17-1
17.1 Backing up a File System
Backing up a complete file system gives you full control over the files to back up, and
allows restoration of individual files as needed. File system-level backups are also the
easiest to integrate into existing backup solutions.
File system backups are performed from a Lustre client (or many clients working
parallel in different directories) rather than on individual server nodes; this is no
different than backing up any other file system.
However, due to the large size of most Lustre file systems, it is not always possible to
get a complete backup. We recommend that you back up subsets of a file system.
This includes subdirectories of the entire file system, filesets for a single user, files
incremented by date, and so on.
Note – In order to allow Lustre to scale the filesystem namespace for future
applications, Lustre 2.x internally uses a 128-bit file identifier for all files. To interface
with user applications, Lustre presents 64-bit inode numbers for the stat(), fstat(), and
readdir() system calls on 64-bit applications, and 32-bit inode numbers to 32-bit
applications.
Some 32-bit applications accessing Lustre filesystems (on both 32-bit and 64-bit
CPUs) may experience problems with the stat(), fstat() or readdir() system calls under
certain circumstances, though the Lustre client should return 32-bit inode numbers to
these applications.
In particular, if the Lustre filesystem is exported from a 64-bit client via NFS to a
32-bit client, the Linux NFS server will export 64-bit inode numbers to applications
running on the NFS client. If the 32-bit applications are not compiled with Large File
Support (LFS), then they return EOVERFLOW errors when accessing the Lustre files.
To avoid this problem, Linux NFS clients can use the kernel command-line option
"nfs.enable_ino64=0" in order to force the NFS client to export 32-bit inode numbers
to the client.
Workaround: We very strongly recommended that backups using tar(1) and other
utilities that depend on the inode number to uniquely identify an inode to be run on
64-bit clients. The 128-bit Lustre file identifiers cannot be uniquely mapped to a 32-bit
inode number, and as a result these utilities may operate incorrectly on 32-bit clients.
The first time that lustre_rsync is run, the user must specify a set of parameters
for the program to use. These parameters are described in the following table and in
Section 36.13, “lustre_rsync” on page 36-23. On subsequent runs, these parameters
are stored in the the status file, and only the name of the status file needs to be
passed to lustre_rsync.
- AND -
■ Verify that the Lustre file system (source) and the replica file system (target) are
identical before registering the changelog user. If the file systems are discrepant,
use a utility, e.g. regular rsync (not lustre_rsync), to make them identical.
Parameter Description
--source=<src> The path to the root of the Lustre file system (source) which will be
synchronized. This is a mandatory option if a valid status log created
during a previous synchronization operation (--statuslog) is not
specified.
--target=<tgt> The path to the root where the source file system will be
synchronized (target). This is a mandatory option if the status log
created during a previous synchronization operation (--statuslog)
is not specified. This option can be repeated if multiple
synchronization targets are desired.
--mdt=<mdt> The metadata device to be synchronized. A changelog user must be
registered for this device. This is a mandatory option if a valid status
log created during a previous synchronization operation
(--statuslog) is not specified.
--user=<user The changelog user ID for the specified MDT. To use lustre_rsync,
id> the changelog user must be registered. For details, see the
changelog_register parameter in Section 36.3, “lctl” on
page 36-4. This is a mandatory option if a valid status log created
during a previous synchronization operation (--statuslog) is not
specified.
--statuslog= A log file to which synchronization status is saved. When the
<log> lustre_rsync utility starts, if the status log from a previous
synchronization operation is specified, then the state is read from the
log and otherwise mandatory --source, --target and --mdt
options can be skipped. Specifying the --source, --target and/or
--mdt options, in addition to the --statuslog option, causes the
specified parameters in the status log to be overriden. Command line
options take precedence over options in the status log.
--xattr Specifies whether extended attributes (xattrs) are synchronized or
<yes|no> not. The default is to synchronize extended attributes.
Note - Disabling xattrs causes Lustre striping information not to be
synchronized.
--verbose Produces verbose output.
--dry-run Shows the output of lustre_rsync commands (copy, mkdir, etc.)
on the target file system without actually executing them.
--abort-on-err Stops processing the lustre_rsync operation if an error occurs.
The default is to continue the operation.
After the file system undergoes changes, synchronize the changes onto the target file
system. Only the statuslog name needs to be specified, as it has all the parameters
passed earlier.
$ lustre_rsync --source=/mnt/lustre \
--target=/mnt/target1 --target=/mnt/target2 \
--mdt=lustre-MDT0000 --user=cl1
--statuslog sync.log
Note – Keeping an updated full backup of the MDT is especially important because
a permanent failure of the MDT file system renders the much larger amount of data
in all the OSTs largely inaccessible and unusable.
Note – In Lustre 2.0 and 2.1 the only correct way to perform an MDT backup and
restore is to do a device-level backup as is described in this section. The ability to do
MDT file-level backups is not functional in these releases because of the inability to
restore the Object Index (OI) file correctly (see bug 22741 for details).
If hardware replacement is the reason for the backup or if a spare storage device is
available, it is possible to do a raw copy of the MDT or OST from one block device to
the other, as long as the new device is at least as large as the original device. To do
this, run:
If hardware errors cause read problems on the original device, use the command
below to allow as much data as possible to be read from the original device while
skipping sections of the disk with errors:
Even in the face of hardware errors, the ldiskfs file system is very robust and it may
be possible to recover the file system data after running e2fsck -f on the new
device.
Note – In Lustre 2.0 and 2.1 the only correct way to perform an MDT backup and
restore is to do a device-level backup as is described in this section. The ability to do
MDT file-level backups is not functional in these releases because of the inability to
restore the Object Index (OI) file correctly (see bug 22741 for details).
[oss]# cd /mnt/ost
Note – If the tar(1) command supports the --xattr option, the getfattr step may
be unnecessary as long as it does a backup of the "trusted" attributes. However,
completing this step is not harmful and can serve as an added safety measure.
Note – In most distributions, the getfattr command is part of the "attr" package.
If the getfattr command returns errors like Operation not supported, then
the kernel does not correctly support EAs. Stop and use a different backup method.
Note – In Lustre 1.6.7 and later, the --sparse option reduces the size of the backup
file. Be sure to use it so the tar command does not mistakenly create an archive full
of zeros.
[oss]# cd -
[oss]# cd /mnt/ost
[oss]# cd -
If the file system was used between the time the backup was made and when it was
restored, then the lfsck tool (part of Lustre e2fsprogs) can optionally be run to ensure
the file system is coherent. If all of the device file systems were backed up at the
same time after the entire Lustre file system was stopped, this is not necessary. In
either case, the file system should be immediately usable even if lfsck is not run,
though there may be I/O errors reading from files that are present on the MDT but
not the OSTs, and files that were created after the MDT backup will not be
accessible/visible.
Because LVM snapshots cost CPU cycles as new files are written, taking snapshots of
the main Lustre file system will probably result in unacceptable performance losses.
You should create a new, backup Lustre file system and periodically (e.g., nightly)
back up new/changed files to it. Periodic snapshots can be taken of this backup file
system to create a series of "full" backups.
cfs21:~# ls /mnt/main
fstab passwd
You can create as many snapshots as you have room for in the volume group. If
necessary, you can dynamically add disks to the volume group.
The snapshots of the target MDT and OSTs should be taken at the same point in time.
Make sure that the cronjob updating the backup file system is not running, since that
is the only thing writing to the disks. Here is an example:
cfs21:~/cfs/b1_5/lustre/utils# ls /mnt/back
fstab passwds
lvremove /dev/volgroup/MDTb1
This chapter describes file striping and I/O options, and includes the following
sections:
■ How Lustre Striping Works
■ Lustre File Striping Considerations
■ Setting the File Layout/Striping Configuration (lfs setstripe)
■ Retrieving File Layout/Striping Information (getstripe)
■ Managing Free Space
18-1
18.1 How Lustre Striping Works
Lustre uses a round-robin algorithm for selecting the next OST to which a stripe is to
be written. Normally the usage of OSTs is well balanced. However, if users create a
small number of exceptionally large files or incorrectly specify striping parameters,
imbalanced OST usage may result.
The MDS allocates objects on seqential OSTs. Periodically, it will adjust the striping
layout to eliminate some degenerated cases where applications that create very
regular file layouts (striping patterns) would preferentially use a particular OST in
the sequence.
Stripes are written to sequential OSTs until free space across the OSTs differs by
more than 20%. The MDS will then use weighted random allocations with a
preference for allocating objects on OSTs with more free space. This can reduce I/O
performance until space usage is rebalanced to within 20% again.
For a more detailed description of stripe assignments, see Section 18.5, “Managing
Free Space” on page 18-9.
stripe_size
The stripe size indicates how much data to write to one OST before moving to the
next OST. The default stripe_size is 1 MB, and passing a stripe_size of 0
causes the default stripe size to be used. Otherwise, the stripe_size value must be
a multiple of 64 KB.
stripe_count
The stripe count indicates how many OSTs to use. The default stripe_count value
is 1. Setting stripe_count to 0 causes the default stripe count to be used. Setting
stripe_count to -1 means stripe over all available OSTs (full OSTs are skipped).
start_ost
The start OST is the first OST to which files are written. The default value for
start_ost is -1 , which allows the MDS to choose the starting index. This setting
is strongly recommended, as it allows space and load balancing to be done by the
MDS as needed. Otherwise, the file starts on the specified OST index. The numbering
of the OSTs starts at 0.
pool_name
Specify the OST pool on which the file will be written. This allows limiting the OSTs
used to a subset of all OSTs in the file system. For more details about using OST
pools, see Section 19.2, “Creating and Managing OST Pools” on page 19-6.
This example command creates the new file /mnt/lustre/new_file with a stripe
size of 4 MB.
Now, when a file is created, the new stripe setting evenly distributes the data over all
the available OSTs:
The example below indicates that the file full_stripe is striped over all six active
OSTs in the configuration:
This is in contrast to the output in Section 18.3.1.1, “Setting the Stripe Size” on
page 18-5 which shows only a single object for the file.
To change the striping pattern (file layout) for a sub-directory, create a directory with
desired file layout as described above. Sub-directories inherit the file layout of the
root/parent directory.
Note – Striping of new files and sub-directories is done per the striping parameter
settings of the root directory. Once you set striping on the root directory, then, by
default, it applies to any new child directories created in that root directory (unless
they have their own striping settings).
OBDS:
0: home-OST0000_UUID ACTIVE
[...]
bob
obdidx objid objid group
0 33459243 0x1fe8c2b 0
/mnt/lustre
(Default) stripe_count: 1 stripe_size: 1M stripe_offset: -1
In this example, the default stripe count is 1 (data blocks are striped over a single
OSTs), the default stripe size is 1 MB, and objects are created over all available OSTs.
You can also use ls -l /proc/<pid>/fd/ to find open files using Lustre. For
example:
/mnt/lustre/foo
obdidx objid objid group
2 835487 0xcbf9f 0
osc.lustre-OST0002-osc.ost_conn_uuid=192.168.20.1@tcp
Option Description
Note – The df -i and lfs df -i commands show the minimum number of inodes
that can be created in the file system. Depending on the configuration, it may be
possible to create more inodes than initially reported by df -i. Later, df -i
operations will show the current, estimated free inode count.
If the underlying file system has fewer free blocks than inodes, then the total inode
count for the file system reports only as many inodes as there are free blocks. This is
done because Lustre may need to store an external attribute for each new inode, and
it is better to report a free inode count that is the guaranteed, minimum number of
inodes that can be created.
[lin-cli1] $ lfs df -h
UUID bytes Used Available Use% Mounted on
mds-lustre-0_UUID 8.7G 996.1M 7.8G 11% /mnt/lustre[MDT:0]
ost-lustre-0_UUID 89.8G 53.7G 36.1G 59% /mnt/lustre[OST:0]
ost-lustre-1_UUID 89.8G 53.8G 36.0G 59% /mnt/lustre[OST:1]
ost-lustre-2_UUID 89.8G 51.8G 38.0G 57% /mnt/lustre[OST:2]
filesystem summary: 269.5G 159.3G 110.1G 59% /mnt/lustre
[lin-cli1] $ lfs df -i
UUID Inodes IUsed IFree IUse% Mounted on
mds-lustre-0_UUID 2211572 41924 2169648 1% /mnt/lustre[MDT:0]
ost-lustre-0_UUID 737280 12183 725097 1% /mnt/lustre[OST:0]
ost-lustre-1_UUID 737280 12232 725048 1% /mnt/lustre[OST:1]
ost-lustre-2_UUID 737280 12214 725066 1% /mnt/lustre[OST:2]
filesystem summary: 2211572 41924 2169648 1% /mnt/lustre[OST:2]
Increasing this value puts more weighting on free space. When the free space priority
is set to 100%, then location is no longer used in stripe-ordering calculations and
weighting is based entirely on free space.
Note – Setting the priority to 100% means that OSS distribution does not count in
the weighting, but the stripe assignment is still done via a weighting. For example, if
OST2 has twice as much free space as OST1, then OST2 is twice as likely to be used,
but it is not guaranteed to be used.
This chapter describes file striping and I/O options, and includes the following
sections:
■ Handling Full OSTs
■ Creating and Managing OST Pools
■ Adding an OST to a Lustre File System
■ Performing Direct I/O
■ Other I/O Options
19-1
19.1 Handling Full OSTs
Sometimes a Lustre file system becomes unbalanced, often due to
incorrectly-specified stripe settings, or when very large files are created that are not
striped over all of the OSTs. If an OST is full and an attempt is made to write more
information to the file system, an error occurs. The procedures below describe how to
handle a full OST.
The MDS will normally handle space balancing automatically at file creation time,
and this procedure is normally not needed, but may be desirable in certain
circumstances (e.g. when creating very large files that would consume more than the
total free space of the full OSTs).
In this case, OST:2 is almost full and when an attempt is made to write additional
information to the file system (even with uniform striping over all the OSTs), the
write command fails as follows:
2. Use the lctl dl command to show the status of all file system components:
2. To move single object(s), create a new copy and remove the original. Enter:
[client]# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 2.0G 1.3G 598.1M 65% /mnt/lustre[OST:0]
lustre-OST0001_UUID 2.0G 1.3G 594.1M 65% /mnt/lustre[OST:1]
lustre-OST0002_UUID 2.0G 913.4M 1000.0M 45% /mnt/lustre[OST:2]
lustre-OST0003_UUID 2.0G 1.3G 602.1M 65% /mnt/lustre[OST:3]
lustre-OST0004_UUID 2.0G 1.3G 606.1M 64% /mnt/lustre[OST:4]
lustre-OST0005_UUID 2.0G 1.3G 610.1M 64% /mnt/lustre[OST:5]
When an OST pool is defined, it can be used to allocate files. When file or directory
striping is set to a pool, only OSTs in the pool are candidates for striping. If a
stripe_index is specified which refers to an OST that is not a member of the pool,
an error is returned.
OST pools are used only at file creation. If the definition of a pool changes (an OST is
added or removed or the pool is destroyed), already-created files are not affected.
Note – An error (EINVAL) results if you create a file using an empty pool.
Note – If a directory has pool striping set and the pool is subsequently removed, the
new files created in this directory have the (non-pool) default striping pattern for that
directory applied and no error is returned.
The lctl command MUST be run on the MGS. Another requirement for managing
OST pools is to either have the MDT and MGS on the same node or have a Lustre
client mounted on the MGS node, if it is separate from the MDS. This is needed to
validate the pool commands being run are correct.
Caution – Running the writeconf command on the MDS erases all pools
information (as well as any other parameters set using lctl conf_param). We
recommend that the pools definitions (and conf_param settings) be executed using
a script, so they can be reproduced easily after a writeconf is performed.
Where:
■ <ost_list> is <fsname->OST<index_range>[_UUID]
■ <index_range> is <ost_index_start>-<ost_index_end>[,<index_range>]
or <ost_index_start>-<ost_index_end>/<step>
If the leading <fsname> and/or ending _UUID are missing, they are automatically
added.
For example, to add even-numbered OSTs to pool1 on file system lustre, run a
single command (add) to add many OSTs to the pool at one time:
lctl pool_add lustre.pool1 OST[0-10/2]
Note – Each time an OST is added to a pool, a new llog configuration record is
created. For convenience, you can run a single command.
Note – All OSTs must be removed from a pool before it can be destroyed.
To associate a directory with a pool, so all new files and directories will be created in
the pool, run:
Note – If you specify striping with an invalid pool name, because the pool does not
exist or the pool name was mistyped, lfs setstripe returns an error. Run lfs
pool_list to make sure the pool exists and the pool name is entered correctly.
Note – The --pool option for lfs setstripe is compatible with other modifiers.
For example, you can set striping on a directory to use an explicit starting index.
Applications using the read() and write() calls must supply buffers aligned on a page
boundary (usually 4 K). If the alignment is not correct, the call returns -EINVAL.
Direct I/O may help performance in cases where the client is doing a large amount of
I/O and is CPU-bound (CPU utilization 100%).
chattr +i <file>
If this happens, the client will re-read or re-write the affected data up to five times to
get a good copy of the data over the network. If it is still not possible, then an I/O
error is returned to the application.
$ cat /proc/fs/lustre/osc/<fsname>-OST<index>-osc-*/checksum_type
Note – The in-memory checksum always uses the adler32 algorithm, if available,
and only falls back to crc32 if adler32 cannot be used.
$ cat /proc/fs/lustre/osc/lustre-OST0000-osc- \
ffff81012b2c48e0/checksum_type
crc32 [adler]
$ cat /proc/fs/lustre/osc/lustre-OST0000-osc- \
ffff81012b2c48e0/checksum_type
[crc32] adler
Managing Failover
This chapter describes failover in a Lustre system and includes the following
sections:
■ Lustre Failover and Multiple-Mount Protection
Note – For information about high availability(HA) management software, see the
Lustre wiki topic Using Red Hat Cluster Manager with Lustre or the Lustre wiki
topic Using Pacemaker with Lustre.
20-1
20.1 Lustre Failover and Multiple-Mount
Protection
The failover functionality in Lustre is implemented with the multiple-mount
protection (MMP) feature, which protects the file system from being mounted
simultaneously to more than one node. This feature is important in a shared storage
environment (for example, when a failover pair of OSTs share a partition).
Lustre's backend file system, ldiskfs, supports the MMP mechanism. A block in the
file system is updated by a kmmpd daemon at one second intervals, and a sequence
number is written in this block. If the file system is cleanly unmounted, then a special
"clean" sequence is written to this block. When mounting the file system, ldiskfs
checks if the MMP block has a clean sequence or not.
Even if the MMP block has a clean sequence, ldiskfs waits for some interval to guard
against the following situations:
■ If I/O traffic is heavy, it may take longer for the MMP block to be updated.
■ If another node is trying to mount the same file system, a "race" condition may
occur.
With MMP enabled, mounting a clean file system takes at least 10 seconds. If the file
system was not cleanly unmounted, then the file system mount may require
additional time.
Note – The MMP feature is only supported on Linux kernel versions >= 2.6.9.
Use the following commands to determine whether MMP is running in Lustre and to
enable or disable the MMP feature.
When MMP is enabled, if ldiskfs detects multiple mount attempts after the file
system is mounted, it blocks these later mount attempts and reports the time when
the MMP block was last updated, the node name, and the device name of the node
where the file system is currently mounted.
This chapter describes how to configure quotas and includes the following sections:
■ Working with Quotas
■ Enabling Disk Quotas
■ Creating Quota Files and Quota Administration
■ Quota Allocation
■ Known Issues with Quotas
■ Lustre Quota Statistics
21-1
21.1 Working with Quotas
Quotas allow a system administrator to limit the amount of disk space a user or
group can use in a directory. Quotas are set by root, and can be specified for
individual users and/or groups. Before a file is written to a partition where quotas
are set, the quota of the creator's group is checked. If a quota exists, then the file size
counts towards the group's quota. If no quota exists, then the owner's user quota is
checked before the file is written. Similarly, inode usage for specific functions can be
controlled if a user over-uses the allocated space.
Lustre quota enforcement differs from standard Linux quota enforcement in several
ways:
■ Quotas are administered via the lfs command (post-mount).
■ Quotas are distributed (as Lustre is a distributed file system), which has several
ramifications.
■ Quotas are allocated and consumed in a quantized fashion.
■ Client does not set the usrquota or grpquota options to mount. When quota is
enabled, it is enabled for all clients of the file system; started automatically using
quota_type or started manually with lfs quotaon.
Caution – Although quotas are available in Lustre, root quotas are NOT enforced.
lfs quota -u root (usage includes internal Lustre data that is dynamic in size
and does not accurately reflect mount point visible block and inode usage).
1. If you have re-complied your Linux kernel, be sure that CONFIG_QUOTA and
CONFIG_QUOTACTL are enabled. Also, verify that CONFIG_QFMT_V1
and/or CONFIG_QFMT_V2 are enabled.
Quota is enabled in all Linux 2.6 kernels supplied for Lustre.
3. Mount the Lustre file system on the client and verify that the lquota module has
loaded properly by using the lsmod command.
$ lsmod
[root@oss161 ~]# lsmod
Module Size Used by
obdfilter 220532 1
fsfilt_ldiskfs 52228 1
ost 96712 1
mgc 60384 1
ldiskfs 186896 2 fsfilt_ldiskfs
lustre 401744 0
lov 289064 1 lustre
lquota 107048 4 obdfilter
mdc 95016 1 lustre
ksocklnd 111812 1
The Lustre mount command no longer recognizes the usrquota and grpquota
options. If they were previously specified, remove them from /etc/fstab.
When quota is enabled, it is enabled for all file system clients (started automatically
using quota_type or manually with lfs quotaon).
Note – Lustre with the Linux kernel 2.4 does not support quotas.
To enable quotas automatically when the file system is started, you must set the
mdt.quota_type and ost.quota_type parameters, respectively, on the MDT and
OSTs. The parameters can be set to the string u (user), g (group) or ug for both users
and groups.
Lustre 1.6.5 introduced the v2 file format for administrative quota files, with
continued support for the old file format (v1). The mdt.quota_type parameter also
handles ‘1’ and ‘2’ options, to specify the Lustre quota versions that will be used. For
example:
--param mdt.quota_type=ug1
--param mdt.quota_type=u2
Lustre 1.6.6 introduced the v2 file format for operational quotas, with continued
support for the old file format (v1). The ost.quota_type parameter handles ‘1’ and
‘2’ options, to specify the Lustre quota versions that will be used. For example:
--param ost.quota_type=ug2
--param ost.quota_type=u1
For more information about the v1 and v2 formats, see Section 21.5.3, “Quota File
Formats” on page 21-12.
Caution – When lfs quotacheck is run, Lustre must NOT be performing any
write operations. Failure to follow this caution may cause the statistic information of
quota to be inaccurate. For example, the number of blocks used by OSTs for users or
groups will be inaccurate, which can cause unexpected quota problems.
Note – User and group quotas are separate. If either quota limit is reached, a process
with the corresponding UID/GID cannot allocate more space on the file system.
Note – When lfs quotacheck runs, it creates a quota file -- a sparse file with a size
proportional to the highest UID in use and UID/GID distribution. As a general rule,
if the highest UID in use is large, then the sparse file will be large, which may affect
functions such as creating a snapshot.
In certain failure situations (e.g., when a broken Lustre installation or build is used),
re-run quotacheck after checking the server kernel logs and fixing the root
problem.
The lfs command includes several command options to work with quotas:
■ quotaon — enables disk quotas on the specified file system. The file system quota
files must be present in the root directory of the file system.
■ quotaoff — disables disk quotas on the specified file system.
■ quota — displays general quota information (disk usage and limits)
■ setquota — specifies quota limits and tunes the grace period. By default, the
grace period is one week.
Usage:
Examples:
To display general quota information for a specific user ("bob" in this example), run:
To display general quota information for a specific user ("bob" in this example) and
detailed quota statistics for each MDT and OST, run:
To display general quota information for a specific group ("eng" in this example),
run:
To display block and inode grace times for user quotas, run:
To set user and group quotas for a specific user ("bob" in this example), run:
Note – For the Lustre command $ lfs setquota/quota ... the qunit for block is KB
(1024) and the qunit for inode is 1.
The quota command displays the quota allocated and consumed for each Lustre
device. Using the previous setquota example, running this lfs quota command:
The quota system in Lustre is completely compatible with the quota systems used on
other file systems. The Lustre quota system distributes quotas from the quota master.
Generally, the MDS is the quota master for both inodes and blocks. All OSTs and the
MDS are quota slaves to the OSS nodes. To reduce quota requests and get reasonably
accurate quota distribution, the transfer quota unit (qunit) between quota master and
quota slaves is changed dynamically by the lquota module. The default minimum
value of qunit is 1 MB for blocks and 2 for inodes. The proc entries to set these values
are: /proc/fs/lustre/mds/lustre-MDT*/quota_least_bunit and
/proc/fs/lustre/mds/lustre-MDT*/quota_least_iunit. The default
maximum value of qunit is 128 MB for blocks and 5120 for inodes. The proc entries to
set these values are quota_bunit_sz and quota_iunit_sz in the MDT and OSTs.
Note – In general, the quota_bunit_sz value should be larger than 1 MB. For
testing purposes, it can be set to 4 KB, if necessary.
The file system block quota is divided up among the OSTs and the MDS within the
file system. Only the MDS uses the file system inode quota.
This means that the minimum quota for block is 1 MB* (the number of OSTs + the
number of MDSs), which is 1 MB* (number of OSTs + 1). If you attempt to assign a
smaller quota, users maybe not be able to create files. As noted, the default minimum
quota for inodes is 2. The default is established at file system creation time, but can
be tuned via /proc values (described below). The inode quota is also allocated in a
quantized manner on the MDS.
If we look at the setquota example again, running this lfs quota command:
Note – Values appended with “*” show the limit that has been over-used (exceeding
the quota), and receives this message Disk quota exceeded. For example:
\
$ cp: writing `/mnt/lustre/var/cache/fontconfig/
beeeeb3dfe132a8a0633a017c99ce0-x86.cache’: Disk quota exceeded.
Note – It is very important to note that the block quota is consumed per OST and the
MDS per block and inode (there is only one MDS for inodes). Therefore, when the
quota is consumed on one OST, the client may not be able to create files regardless of
the quota available on other OSTs.
Additional information:
Grace period — The period of time (in seconds) within which users are allowed to
exceed their soft limit. There are four types of grace periods:
■ user block soft limit
■ user inode soft limit
■ group block soft limit
■ group inode soft limit
The grace periods are applied to all users. The user block soft limit is for all users
who are using a blocks quota.
Soft limit — Once you are beyond the soft limit, the quota module begins to time,
but you still can write block and inode. When you are always beyond the soft limit
and use up your grace time, you get the same result as the hard limit. For inodes and
blocks, it is the same. Usually, the soft limit MUST be less than the hard limit; if not,
the quota module never triggers the timing. If the soft limit is not needed, leave it as
zero (0).
The /proc values are bounded by two other variables quota_btune_sz and
quota_itune_sz. By default, the *tune_sz variables are set at 1/2 the *unit_sz
variables, and you cannot set *tune_sz larger than *unit_sz. You must set
bunit_sz first if it is increasing by more than 2x, and btune_sz first if it is
decreasing by more than 2x.
Total number of inodes — To determine the total number of inodes, use lfs df -i
(and also /proc/fs/lustre/*/*/filestotal). For more information on using
the lfs df -i command and the command output, see Section 18.5.1, “Checking File
System Free Space” on page 18-9.
Unfortunately, the statfs interface does not report the free inode count directly, but
instead reports the total inode and used inode counts. The free inode count is
calculated for df from (total inodes - used inodes).
It is not critical to know a file system’s total inode count. Instead, you should know
(accurately), the free inode count and the used inode count for a file system. Lustre
manipulates the total inode count in order to accurately report the other two values.
The values set for the MDS must match the values set on the OSTs.
The quota_bunit_sz parameter displays bytes, however lfs setquota uses KBs.
The quota_bunit_sz parameter must be a multiple of 1024. A proper minimum KB
size for lfs setquota can be calculated as:
Size in KBs = minimum_quota_bunit_sz * (number of OSTS + 1) = 1024 * (number of OSTs +1)
We add one (1) to the number of OSTs as the MDS also consumes KBs. As inodes are
only consumed on the MDS, the minimum inode size for lfs setquota is equal to
quota_iunit_sz.
Note – Setting the quota below this limit may prevent the user from all file creation.
2. If the Lustre client has enough granted cache, then it returns ‘success’ to users
and arranges the writes to the OSTs.
3. Because Lustre clients have delivered success to users, the OSTs cannot fail
these writes.
Because of granted cache, writes always overwrite quota limitations. For example, if
you set a 400 GB quota on user A and use IOR to write for user A from a bundle of
clients, you will write much more data than 400 GB, and cause an out-of-quota error
(-EDQUOT).
Note – The effect of granted cache on quota limits can be mitigated, but not
eradicated. Reduce the max_dirty_buffer in the clients (can be set from 0 to 512).
To set max_dirty_buffer to 0:
Lustre Version Quota Limit Per User/Per Group OST Storage Limit
Lustre 1.6.5 and later use mdt.quota_type to force a specific administrative quota
version (v2 or v1).
■ For the v2 quota file format, (OBJECTS/admin_quotafile_v2.{usr,grp})
■ For the v1 quota file format, (OBJECTS/admin_quotafile.{usr,grp})
Lustre 1.6.6 and later use ost.quota_type to force a specific operational quota
version (v2 or v1).
■ For the v2 quota file format, (lquota_v2.{user,group})
■ For the v1 quota file format, (lquota.{user,group})
If quotas do not exist or look broken, then quotacheck creates quota files of a
required name and format.
If Lustre is using the v2 quota file format when only v1 quota files exist, then
quotacheck converts old v1 quota files to new v2 quota files. This conversion is
triggered automatically, and is transparent to users. If an old quota file does not exist
or looks broken, then the new v2 quota file will be empty. In case of an error, details
can be found in the kernel log of the corresponding MDS/OST. During conversion of
a v1 quota file to a v2 quota file, the v2 quota file is marked as broken, to avoid it
being used if a crash occurs. The quota module does not use broken quota files
(keeping quota off).
Each quota statistic consists of a quota event and min_time, max_time and
sum_time values for the event.
cat /proc/fs/lustre/lquota/lustre-OST0000/stats
In the first line, snapshot_time indicates when the statistics were taken. The
remaining lines list the quota events and their associated data.
In the second line, the async_acq_req event occurs one time. The min_time,
max_time and sum_time statistics for this event are 32, 32 and 32, respectively. The
unit is microseconds (µs).
In the fifth line, the quota_ctl event occurs four times. The min_time, max_time
and sum_time statistics for this event are 80, 3470 and 4293, respectively. The unit is
microseconds (µs).
This chapter describes Lustre security and includes the following sections:
■ Using ACLs
■ Using Root Squash
22-1
22.1 Using ACLs
An access control list (ACL), is a set of data that informs an operating system about
permissions or access rights that each user or group has to specific system objects,
such as directories or files. Each object has a unique security attribute that identifies
users who have access to it. The ACL lists each object and user access privileges such
as read, write or execute.
http://www.suse.de/~agruen/acl/linux-acls/online/
We have implemented ACLs according to this model. Lustre works with the standard
Linux ACL tools, setfacl, getfacl, and the historical chacl, normally installed with the
ACL package.
Note – ACL support is a system-range feature, meaning that all clients have ACL
enabled or not. You cannot specify which clients should enable ACL.
The ls -l command displays the owner, group, and other class permissions in the
first column of its output (for example, -rw-r- -- for a regular file with read and
write access for the owner class, read access for the group class, and no access for
others).
Minimal ACLs have three entries. Extended ACLs have more than the three entries.
Extended ACLs also contain a mask entry and may contain any number of named
user and named group entries.
Alternately, you can enable ACLs at run time by using the --acl option with
mkfs.lustre:
ACLs are enabled in Lustre on a system-wide basis; either all clients enable ACLs or
none do. Activating ACLs is controlled by MDS mount options acl / noacl
(enable/disableACLs). Client-side mount options acl/noacl are ignored. You do
not need to change the client configuration, and the “acl” string will not appear in
the client /etc/mtab. The client acl mount option is no longer needed. If a client is
mounted with that option, then this message appears in the MDS syslog:
If ACLs are not enabled on the MDS, then any attempts to reference an ACL on a
client return an Operation not supported error.
22.1.3 Examples
These examples are taken directly from the POSIX paper referenced above. ACLs on
a Lustre file system work exactly like ACLs on any Linux file system. They are
manipulated with the standard tools in the standard manner. Below, we create a
directory and allow a specific user access.
The root squash feature works by re-mapping the user ID (UID) and group ID (GID)
of the root user to a UID and GID specified by the system administrator, via the
Lustre configuration management server (MGS). The root squash feature also enables
the Lustre administrator to specify a set of client for which UID/GID re-mapping
does not apply.
nosquash_nids=172.16.245.[0-255/2]@tcp
In this example, root squash does not apply to TCP clients on subnet 172.16.245.0
that have an even number as the last component of their IP address.
Root squash parameters can be set when the MDT is created (mkfs.lustre --mdt).
For example:
mkfs.lustre --reformat --fsname=Lustre --mdt --mgs \
--param "mds.root_squash=500:501" \
--param "mds.nosquash_nids='0@elan1 192.168.1.[10,11]'" /dev/sda1
Root squash parameters can also be changed with the lctl conf_param command.
For example:
- OR -
If the nosquash_nids value consists of several NID ranges (e.g. 0@elan, 1@elan1),
the list of NID ranges must be quoted with single (') or double ('') quotation marks.
List elements must be separated with a space. For example:
mkfs.lustre ... --param "mds.nosquash_nids='0@elan1 1@elan2'" /dev/sda1
lctl conf_param Lustre.mds.nosquash_nids="24@elan 15@elan1"
Part IV describes tools and procedures used to tune a Lustre file system for optimum
performance. You will find information in this section about:
Lustre Tuning
CHAPTER 23
This chapter describes the LNET self-test, which is used by site administrators to
confirm that Lustre Networking (LNET) has been properly installed and configured,
and that underlying network software and hardware are performing according to
expectations. The chapter includes:
23-1
23.1 LNET Self-Test Overview
LNET self-test is a kernel module that runs over LNET and the Lustre network
drivers (LNDs. It is designed to:
■ Test the connection ability of the Lustre network
■ Run regression tests of the Lustre network
■ Test performance of the Lustre network
After you have obtained performance results for your Lustre network, refer to
Chapter 25: Lustre Tuning for information about parameters that can be used to tune
LNET for optimum performance.
Note – Apart from the performance impact, LNET self-test is invisible to Lustre.
Note – Test nodes can be in either kernel or userspace. A console user can invite a
kernel test node to join the test session by running lst add_group NID, but the
console user cannot actively add a userspace test node to the test-session. However,
the console user can passively accept a test node to the test session while the test
node is running lstclient to connect to the console.
modprobe lnet_selftest
This command recursively loads the modules on which LNET self-test depends.
Note – While the console and test nodes require all the prerequisite modules to be
loaded, userspace test nodes do not require these modules.
Almost all operations should be performed within the context of a session. From the
console node, a user can only operate nodes in his own session. If a session ends, the
session context in all test nodes is stopped.
export LST_SESSION=$$
lst new_session read_write
Each node in a group has a rank, determined by the order in which it was added to
the group. The rank is used to establish test traffic patterns.
A user can only control nodes in his/her session. To allocate nodes to the session, the
user needs to add nodes to a group (of the session). All nodes in a group can be
referenced by the group name. A node can be allocated to multiple groups of a
session.
Note – A console user can associate kernel space test nodes with the session by
running lst add_group NIDs, but a userspace test node cannot be actively added
to the session. However, the console user can passively "accept" a test node to
associate with a test session while the test node running lstclient connects to the
console node, i.e: lstclient --sesid CONSOLE_NID --group NAME).
A batch is a collection of tests that are started and stopped together and run in
parallel. A test must always be run as part of a batch, even if it is just a single test.
Users can only run or stop a test batch, not individual tests.
Tests in a batch are non-destructive to the file system, and can be run in a normal
Lustre environment (provided the performance impact is acceptable).
A simple batch might contain a single test, for example, to determine whether the
network bandwidth presents an I/O bottleneck. In this example, the --to <group>
could be comprised of Lustre OSSs and --from <group> the compute nodes. A
second test could be added to perform pings from a login node to the MDS to see
how checkpointing affects the ls -l process.
In the example below, a batch is created called bulk_rw. Then two brw tests are
added. In the first test, 1M of data is sent from the servers to the clients as a
simulated read operation with a simple data validation check. In the second test, 4K
of data is sent from the clients to the servers as a simulated write operation with a
full data validation check.
The traffic pattern and test intensity is determined by several properties such as test
type, distribution of test nodes, concurrency of test, and RDMA operation type. For
more details see Section 23.3.3, “Batch and Test Commands” on page 23-11.
#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 192.168.10.[8,10,12-16]@tcp
lst add_group readers 192.168.1.[1-253/2]@o2ib
lst add_group writers 192.168.1.[2-254/2]@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session
Note – This script can be easily adapted to pass the group NIDs by shell variables or
command line arguments (making it good for general-purpose use).
LST_SESSION
The lst utility uses the LST_SESSION environmental variable to identify the session
locally on the self-test console node. This should be a numeric value that uniquely
identifies all session processes on the node. It is convenient to set this to the process
ID of the shell both for interactive use and in shell scripts. Almost all lst commands
require LST_SESSION to be set.
Example:
export LST_SESSION=$$
Parameter Description
–-timeout <seconds> Console timeout value of the session. The session ends
automatically if it remains idle (i.e., no commands are issued)
for this period.
--force Ends conflicting sessions. This determines who “wins” when
one session conflicts with another. For example, if there is
already an active session on this node, then the attempt to create
a new session fails unless the -force flag is specified. If the
-force flag is specified, then the active session is ended.
Similarly, if a session attempts to add a node that is already
“owned” by another session, the -force flag allows this session
to “steal” the node.
<name> A human-readable string to print when listing sessions or
reporting session conflicts.
Example:
$ lst new_session --force read_write
Stops all operations and tests in the current session and clears the session’s status.
$ lst end_session
show_session
Shows the session information. This command prints information about the current
session. It does not require LST_SESSION to be defined in the process environment.
$ lst show_session
Creates the group and adds a list of test nodes to the group.
Parameter Description
<name> Name of the group.
<NIDs> A string that may be expanded to include one or more LNET NIDs.
Example:
Parameter Description
–-refresh Refreshes the state of all inactive nodes in the group.
–-clean <status> Removes nodes with a specified status from the group. Status may
be:
active The node is in the current session.
busy The node is now owned by another session.
down The node has been marked down.
unknown The node’s status has yet to be determined.
invalid Any state but active.
–-remove <NIDs> Removes specified nodes from the group.
Example:
Prints information about a group or lists all groups in the current session if no group
is specified.
Parameter Description
<name> The name of the group.
–-active Lists the active nodes.
–-busy Lists the busy nodes.
–-down Lists the down nodes.
–-unknown Lists unknown nodes.
–-all Lists all nodes.
$ lst list_group
1) clients
2) servers
Total 2 groups
$ lst list_group clients
ACTIVE BUSY DOWN UNKNOWN TOTAL
3 1 2 0 6
$ lst list_group clients --all
192.168.1.10@tcp Active
192.168.1.11@tcp Active
192.168.1.12@tcp Busy
192.168.1.13@tcp Active
192.168.1.14@tcp DOWN
192.168.1.15@tcp DOWN
Total 6 nodes
$ lst list_group clients --busy
192.168.1.12@tcp Busy
Total 1 node
del_group <name>
Removes a group from the session. If the group is referred to by any test, then the
operation fails. If nodes in the group are referred to only by this group, then they are
kicked out from the current session; otherwise, they are still in the current session.
Use lstclient to run the userland self-test client. The lstclient command
should be executed after creating a session on the console. There are only two
mandatory options for lstclient:
Parameter Description
–-sesid <NID> The first console’s NID.
–-group <name> The test group to join.
--server_mode When included, forces LNET to behave as a server, such as starting an
acceptor if the underlying NID needs it or using privileged ports. Only
root is allowed to use the --server_mode option.
Example:
add_batch NAME
A default batch test set named batch is created when the session is started. You can
specify a batch name by using add_batch:
Parameter Description
--batch <batchname> Names a group of tests for later execution.
–-loop <#> Number of times to run the test.
–-concurrency <#> The number of requests that are active at one time.
–-distribute <#:#> Determines the ratio of client nodes to server nodes for the
specified test. This allows you to specify a wide range of
topologies, including one-to-one and all-to-all. Distribution
divides the source group into subsets, which are paired with
equivalent subsets from the target group so only nodes in
matching subsets communicate.
–-from <group> The source group (test client).
–-to <group> The target group (test server).
The setting --distribute 1:1 is the default setting where each source node
communicates with one target node.
When the setting --distribute 1:<n> (where <n> is the size of the target group)
is used, each source node communicates with every node in the target group.
Note that if there are more source nodes than target nodes, some source nodes may
share the same target nodes. Also, if there are more target nodes than source nodes,
some higher-ranked target nodes will be idle.
Lists batches in the current session or lists client and server nodes in a batch or a test.
Parameter Description
–-test <index> Lists tests in a batch. If no option is used, all tests in the batch are
listed. IIf one of these options are used, only specified tests in the
batch are listed:
active Lists only active batch tests.
invalid Lists only invalid batch tests.
server | client Lists client and server nodes in a batch test.
Example:
$ lst list_batch
bulkperf
$ lst list_batch bulkperf
Batch: bulkperf Tests: 1 State: Idle
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 8 0 0 0 8
server 4 0 0 0 4
Test 1(brw) (loop: 100, concurrency: 4)
ACTIVE BUSY DOWN UNKNOWN TOTAL
client 8 0 0 0 8
server 4 0 0 0 4
$ lst list_batch bulkperf --server --active
192.168.10.100@tcp Active
192.168.10.101@tcp Active
192.168.10.102@tcp Active
192.168.10.103@tcp Active
stop <name>
query <name> [--test <index>] [--timeout <seconds> [--loop <#>] [--delay <seconds>]
[--all]
Parameter Description
–-test <index> Only queries the specified test. The test index starts from 1.
–-timeout <seconds> The timeout value to wait for RPC. The default is 5 seconds.
–-loop <#> The loop count of the query.
–-delay <seconds> The interval of each query. The default is 5 seconds.
–-all The list status of all nodes in a batch or a test.
Example:
Parameter Description
–-session Pings all nodes in the current session.
–-group <name> Pings all nodes in a specified group.
–-nodes <NIDs> Pings all specified nodes.
–-batch <name> Pings all client nodes in a batch.
–-server Sends RPC to all server nodes instead of client nodes. This
option is only used with –-batch <name>.
–-timeout <seconds> The RPC timeout value.
Example:
Parameter Description
–-bw Displays the bandwidth of the specified group/nodes.
–-rate Displays the rate of RPCs of the specified group/nodes.
–-read Displays the read statistics of the specified group/nodes.
–-write Displays the write statistics of the specified group/nodes.
–-max Displays the maximum value of the statistics.
–-min Displays the minimum value of the statistics.
–-avg Displays the average of the statistics.
–-timeout <seconds> The timeout of the statistics RPC. The default is 5 seconds.
–-delay <seconds> The interval of the statistics (in seconds).
Example:
Specifying a group name (<group>) causes statistics to be gathered for all nodes in a
test group. For example:
Specifying a NID range (<NIDs>) causes statistics to be gathered for selected nodes.
For example:
Parameter Description
–-session Lists errors in the current test session. With this option, historical RPC errors are
not listed.
Example:
This chapter describes the Lustre I/O kit, a collection of I/O benchmarking tools for
a Lustre cluster, and PIOS, a parallel I/O simulator for Linux and Solaris. It includes:
■ Using Lustre I/O Kit Tools
■ Testing I/O Performance of Raw Hardware (sgpdd_survey)
■ Testing OST Performance (obdfilter_survey)
■ Testing OST I/O Performance (ost_survey)
■ Collecting Application Profiling Information (stats-collect)
24-1
24.1 Using Lustre I/O Kit Tools
The tools in the Lustre I/O Kit are used to benchmark Lustre hardware and validate
that it is working as expected before you install the Lustre software. It can also be
used to to validate the performance of the various hardware and software layers in
the cluster and also to find and troubleshoot I/O issues.
Typically, performance is measured starting with single raw devices and then
proceeding to groups of devices. Once raw performance has been established, other
software layers are then added incrementally and tested.
Typically with these tests, Lustre should deliver 85-90% of the raw device
performance.
http://downloads.lustre.org/public/tools/lustre-iokit/
The script uses sgp_dd to carry out raw sequential disk I/O. It runs with variable
numbers of sgp_dd threads to show how performance varies with different request
queue depths.
The script spawns variable numbers of sgp_dd instances, each reading or writing a
separate area of the disk to demonstrate performance variance within a number of
concurrent stripe files.
Several tips and insights for disk performance measurement are described below.
Some of this information is specific to RAID arrays and/or the Linux RAID
implementation.
■ Performance is limited by the slowest disk.
Before creating a RAID array, benchmark all disks individually. We have
frequently encountered situations where drive performance was not consistent for
all devices in the array. Replace any disks that are significantly slower than the
rest.
Caution – The sgpdd_survey script overwrites the device being tested, which
results in the LOSS OF ALL DATA on that device. Exercise caution when selecting the
device to be tested.
Note – Array performance with all LUNs loaded does not always match the
performance of a single LUN when tested in isolation.
Prequisites:
■ sgp_dd tool in the sg3_utils package
■ Lustre software is NOT required
The device(s) being tested must meet one of these two requirements:
■ If the device is a SCSI device, it must appear in the output of sg_map (make sure
the kernel module sg is loaded).
■ If the device is a raw device, it must appear in the output of raw -qa.
Note – If you need to create raw devices to use the sgpdd_survey tool, note that
raw device 0 cannot be used due to a bug in certain versions of the "raw" utility
(including that shipped with RHEL4U4.)
/sys/block/sdN/queue/max_sectors_kb = 4096
/sys/block/sdN/queue/max_phys_segments = 256
/proc/scsi/sg/allow_dio = 1
/sys/module/ib_srp/parameters/srp_sg_tablesize = 255
/sys/block/sdN/queue/scheduler
When the sgpdd_survey script runs, it creates a number of working files and a pair
of result files. The names of all the files created start with the prefixdefined in the
variable ${rslt}. (The default value is /tmp.) The files include:
■ File containing standard output data (same as stdout)
${rslt}_<date/time>.summary
■ Temporary (tmp) files
${rslt}_<date/time>_*
■ Collected tmp files for post-mortem
${rslt}_<date/time>.detail
The stdout and the .summary file will contain lines like this:
Each line corresponds to a run of the test. Each test run will have a different number
of threads, record size, or number of regions.
■ total_size - Size of file being tested in KBs (8 GB in above example).
■ rsz - Record size in KBs (1 MB in above example).
■ thr - Number of threads generating I/O (1 thread in above example).
■ crg - Current regions, the number of disjount areas on the disk to which I/O is
being sent (1 region in above example, indicating that no seeking is done).
■ MB/s - Aggregate bandwidth measured by dividing the total amount of data by
the elapsed time (180.45 MB/s in the above example).
■ MB/s - The remaining numbers show the number of regions X performance of the
slowest disk as a sanity check on the aggregate bandwidth.
If there are so many threads that the sgp_dd script is unlikely to be able to allocate
I/O buffers, then ENOMEM is printed in place of the aggregate bandwidth result.
The obdfilter_survey script can be run directly on the OSS node to measure the
OST storage performance without any intervening network, or it can be run remotely
on a Lustre client to measure the OST performance including network overhead.
Note – The obdfilter_survey script is NOT scalable beyond tens of OSTs since it
is only intended to measure the I/O performance of individual storage subsystems,
not the scalability of the entire system.
modprobe obdecho
To run the network test, a specific Lustre setup is needed. Make sure that these
configuration requirements have been met.
modprobe obdecho
3. Start lctl and check the device list, which must be empty. Run:
lctl dl
/proc/fs/lustre/obdecho/<echo_srv>/stats
where <echo_srv> is the obdecho server created by the script.
modprobe obdecho
File Description
The obdfilter_survey script iterates over the given number of threads and objects
performing the specified tests and checks that all test processes have completed
successfully.
ost 8 sz 67108864K rsz 1024 obj 8 thr 8 write 613.54 [ 64.00, 82.00]
Where:
Parameter and
value Description
Note – Although the numbers of threads and objects are specified per-OST in the
customization section of the script, the reported results are aggregated over all OSTs.
It is also useful to monitor and record average disk I/O sizes during each test using
the “disk io size” histogram in the file /proc/fs/lustre/obdfilter/
*/brw_stats (see Section 31.2.5, “Watching the OST Block I/O Stream” on
page 31-17 for details). These numbers help identify problems in the system when
full-sized I/Os are not submitted to the underlying disk. This may be caused by
problems in the device driver or Linux block layer.
Note – We have frequently discovered wide performance variations across all LUNs
in a cluster. This may be caused by faulty disks, RAID parity reconstruction during
the test, or faulty network hardware.
To run the ost_survey script, supply a file size (in KB) and the Lustre mount point.
For example, run:
$ ./ost-survey.sh 10 /mnt/lustre
<statistic>_INTERVAL=[0|n]
3. Stop collecting statistics on each node, clean up the temporary file, and create a
profiling tarball.
Enter:
4. Analyze the collected statistics and create a csv tarball for the specified
profiling data.
Lustre Tuning
This chapter contains information about tuning Lustre for better performance and
includes the following sections:
■ Optimizing the Number of Service Threads
■ Tuning LNET Parameters
■ Lockless I/O Tunables
■ Improving Lustre Performance When Working with Small Files
■ Understanding Why Write Performance Is Better Than Read Performance
Note – Many options in Lustre are set by means of kernel module parameters. These
parameters are contained in the modprobe.conf file.
25-1
25.1 Optimizing the Number of Service
Threads
An OSS can have a minimum of 2 service threads and a maximum of 512 service
threads. The number of service threads is a function of how much RAM and how
many CPUs are on each OSS node (1 thread / 128MB * num_cpus). If the load on the
OSS node is high, new service threads will be started in order to process more
requests concurrently, up to 4x the initial number of threads (subject to the maximum
of 512). For a 2GB 2-CPU system, the default thread count is 32 and the maximum
thread count is 128.
Increasing the number of I/O threads allows the kernel and storage to aggregate
many writes together for more efficient disk I/O. The OSS thread pool is
shared—each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB)
for internal I/O buffers.
Determining the optimum number of OST threads is a process of trial and error, and
varies for each particular configuration. Variables include the number of OSTs on
each OSS, number and speed of disks, RAID configuration, and available RAM. You
may want to start with a number of OST threads equal to the number of actual disk
spindles on the node. If you use RAID, subtract any dead spindles not used for
actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and
monitor the performance of clients during usual workloads. If performance is
degraded, increase the thread count and see how that works until performance is
degraded again or you reach satisfactory performance.
After startup, the minimum and maximum number of OSS thread counts can be set
via the {service}.thread_{min,max,started} tunable. To change the tunable
at runtime, run:
For details, see Section 31.2.12, “Setting MDS and OSS Thread Counts” on page 31-27.
After startup, the minimum and maximum number of MDS thread counts can be set
via the {service}.thread_{min,max,started} tunable. To change the tunable
at runtime, run:
For details, see Section 31.2.12, “Setting MDS and OSS Thread Counts” on page 31-27.
At this time, no testing has been done to determine the optimal number of MDS
threads. The default value varies, based on server size, up to a maximum of 32. The
maximum number of threads (MDS_MAX_THREADS) is 512.
Note – The OSS and MDS automatically start new service threads dynamically, in
response to server load within a factor of 4. The default value is calculated the same
way as before. Setting the _mu_threads module parameter disables automatic
thread creation behavior.
ksocklnd has separate parameters for the transmit and receive buffers.
If these parameters are left at the default value (0), the system automatically tunes
the transmit and receive buffer size. In almost every case, this default produces the
best performance. Do not attempt to tune these parameters unless you are a network
expert.
In other cases, if you have an SMP platform with a single fast interface such as 10Gb
Ethernet and more than two CPUs, you may see performance improve by turning
this parameter off.
By default, this parameter is off. As always, you should test the performance to
compare the impact of changing this parameter.
/proc/fs/lustre/ldlm/namespaces/filter-lustre-*
contended_locks - If the number of lock conflicts in the scan of granted and
waiting queues at contended_locks is exceeded, the resource is considered to
be contended.
contention_seconds - The resource keeps itself in a contended state as set in
the parameter.
max_nolock_bytes - Server-side locking set only for requests less than the
blocks set in the max_nolock_bytes parameter. If this tunable is set to zero (0), it
disables server-side locking for read/write requests.
■ Client-side:
/proc/fs/lustre/llite/lustre-*
contention_seconds - llite inode remembers its contended state for the time
specified in this parameter.
■ Client-side statistics:
The /proc/fs/lustre/llite/lustre-*/stats file has new rows for lockless
I/O statistics.
lockless_read_bytes and lockless_write_bytes - To count the total bytes
read or written, the client makes its own decisions based on the request size. The
client does not communicate with the server if the request size is smaller than the
min_nolock_size, without acquiring locks by the client.
In the case of read operations, the reads from clients may come in a different order
and need a lot of seeking to get read from the disk. This noticeably hampers the read
throughput.
For file systems that use socklnd (TCP, Ethernet) as interconnect, there is also
additional CPU overhead because the client cannot receive data without copying it
from the network buffers. In the write case, the client CAN send data without the
additional data copy. This means that the client is more likely to become CPU-bound
during reads than writes.
Part V provides information about troubleshooting a Lustre file system. You will find
information in this section about:
Lustre Troubleshooting
Troubleshooting Recovery
Lustre Debugging
CHAPTER 26
Lustre Troubleshooting
This chapter provides information to troubleshoot Lustre, submit a Lustre bug, and
Lustre performance tips. It includes the following sections:
■ Lustre Error Messages
■ Reporting a Lustre Bug
■ Common Lustre Problems
26-1
26.1 Lustre Error Messages
Several resources are available to help troubleshoot Lustre. This section describes
error numbers, error messages and logs.
The error message initiates with "LustreError" in the console log and provides a short
description of:
■ What the problem is
■ Which process ID had trouble
■ Which server node it was communicating with, and so on.
Collect the first group of messages related to a problem, and any messages that
precede "LBUG" or "assertion failure" errors. Messages that mention server nodes
(OST or MDS) are specific to that server; you must collect similar messages from the
relevant server console logs.
Another Lustre debug log holds information for Lustre action for a short period of
time which, in turn, depends on the processes on the node to use Lustre. Use the
following command to extract debug logs on each of the nodes, run
$ lctl dk <filename>
Note – LBUG freezes the thread to allow capture of the panic stack. A system reboot
is needed to clear the thread.
You can also post a question to the lustre-discuss mailing list or search the
lustre-discuss Archives for information about your issue.
You can run this tool to capture diagnostics output to include in the reported bug. To
run this tool, enter one of these commands:
# lustre-diagnostics.
Output is sent directly to the terminal. Use normal file redirection to send the output
to a file, and then manually attach the file to the bug you are submitting.
If the reported error is -2 (-ENOENT, or "No such file or directory"), then the object is
missing. This can occur either because the MDS and OST are out of sync, or because
an OST object was corrupted and deleted.
If you have recovered the file system from a disk failure by using e2fsck, then
unrecoverable objects may have been deleted or moved to /lost+found on the raw
OST partition. Because files on the MDS still reference these objects, attempts to
access them produce this error.
If you have recovered a backup of the raw MDS or OST partition, then the restored
partition is very likely to be out of sync with the rest of your cluster. No matter
which server partition you restored from backup, files on the MDS may reference
objects which no longer exist (or did not exist when the backup was taken); accessing
those files produces this error.
If the reported error is anything else (such as -5, "I/O error"), it likely indicates a
storage failure. The low-level file system returns this error if it is unable to read from
the storage device.
Suggested Action
If the reported error is -2, you can consider checking in /lost+found on your raw
OST device, to see if the missing object is there. However, it is likely that this object
is lost forever, and that the file that references the object is now partially or
completely lost. Restore this file from backup, or salvage what you can and delete it.
If the reported error is anything else, then you should immediately inspect this
server for storage problems.
To recover from this problem, you must restart Lustre services using these file
systems. There is no other way to know that the I/O made it to disk, and the state of
the cache may be inconsistent with what is on disk.
1. Generate a list of devices and determine the OST’s device number. Run:
$ lctl dl
The lctl dl command output lists the device name and number, along with the
device UUID and the number of references on the device.
3. Determine all files that are striped over the missing OST, run:
4. If necessary, you can read the valid parts of a striped file, run:
5. You can delete these files with the unlink or munlink command.
When you run the unlink or munlink command, the file on the MDS is
permanently removed.
6. If you need to know, specifically, which parts of the file are missing data, then
you first need to determine the file layout (striping pattern), which includes the
index of the missing OST). Run:
7. Use this computation is to determine which offsets in the file are affected: [(C*N
+ X)*S, (C*N + X)*S + S - 1], N = { 0, 1, 2, ...}
where:
C = stripe count
S = stripe size
X = index of bad OST for this file
For example, for a 2 stripe file, stripe size = 1M, the bad OST is at index 0, and you
have holes in the file at: [(2*N + 0)*1M, (2*N + 0)*1M + 1M - 1], N = { 0, 1, 2, ...}
If the file system cannot be mounted, currently there is no way that parses metadata
directly from an MDS. If the bad OST does not start, options to mount the file system
are to provide a loop device OST in its place or replace it with a newly-formatted
OST. In that case, the missing objects are created and are read as zero-filled.
During normal operation, the MDT keeps some pre-created (but unallocated) objects
on the OST, and the relationship between LAST_ID and lov_objid should be
LAST_ID <= lov_objid. Any difference in the file values results in objects being
created on the OST when it next connects to the MDS. These objects are never
actually allocated to a file, since they are of 0 length (empty), but they do no harm.
Creating empty objects enables the OST to catch up to the MDS, so normal operations
resume.
However, in the case where lov_objid < LAST_ID, bad things can happen as the MDS
is not aware of objects that have already been allocated on the OST, and it reallocates
them to new files, overwriting their existing contents.
Although the lov_objid value should be equal to the last_used_object value, the
above rule suffices to keep Lustre happy at the expense of a few leaked objects.
In situations where there is on-disk corruption of the OST, for example caused by
running with write cache enabled on the disks, the LAST_ID value may become
inconsistent and result in a message similar to:
To recover from this situation, determine and set a reasonable LAST_ID value.
Note – The file system must be stopped on all servers before performing this
procedure.
1. The contents of the LAST_ID file must be accurate regarding the actual objects that exist on the OST.
Use GDB:
(gdb) p /x 15028
$2 = 0x3ab4
Or bc:
1. Determine a reasonable value for the LAST_ID file. Check on the MDS:
There is one entry for each OST, in OST index order. This is what the MDS thinks is
the last in-use object.
This shows you the OST state. There may be some pre-created orphans. Check for
zero-length objects. Any zero-length objects with IDs higher than LAST_ID should be
deleted. New objects will be pre-created.
If the OST LAST_ID value matches that for the objects existing on the OST, then it is
possible the lov_objid file on the MDS is incorrect. Delete the lov_objid file on the
MDS and it will be re-created from the LAST_ID on the OSTs.
If you determine the LAST_ID file on the OST is incorrect (that is, it does not match
what objects exist, does not match the MDS lov_objid value), then you have decided
on a proper value for LAST_ID.
1. Access:
cp /mnt/ost/O/0/LAST_ID /tmp/LAST_ID
5. Fix:
vi /tmp/LAST_ID.asc
6. Convert to binary:
7. Verify:
8. Replace:
cp /tmp/LAST_ID.new /mnt/ost/O/0/LAST_ID
9. Clean up:
umount /mnt/ost
Unfortunately, you cannot set sunprc to avoid port 988. If you receive this error, do
the following:
■ Start Lustre before starting any service that uses sunrpc.
■ Use a port other than 988 for Lustre. This is configured in /etc/modprobe.conf
as an option to the LNET module. For example:
Note – You can also use the sysctl command to mitigate the NFS client from
grabbing the Lustre service port. However, this is a partial workaround as other
user-space RPC servers still have the ability to grab the port.
lfs df -h
lfs df -i
Typically, Lustre reports this error to your application. If the application is checking
the return code from its function calls, then it decodes it into a textual error message
such as No space left on device. Both versions of the error message also
appear in the system log.
For more information about the lfs df command, see Section 18.5.1, “Checking File
System Free Space” on page 18-9.
Although it is less efficient, you can also use the grep command to determine which
OST or MDS is running out of space. To check the free space and inodes on a client,
enter:
Note – You can find other numeric error codes along with a short name and text
description in /usr/include/asm/errno.h.
For example, if a RAID rebuild is really slowing down I/O on an OST, it might
trigger watchdog timers to trip. But another message follows shortly thereafter,
indicating that the thread in question has completed processing (after some number
of seconds). Generally, this indicates a transient problem. In other cases, it may
legitimately signal that a thread is stuck because of a software error (lock inversion,
for example).
Lustre: 0:0:(watchdog.c:122:lcw_cb())
Lustre: 0:0:(linux-debug.c:132:portals_debug_dumpstack())
Call trace:
filter_do_bio+0x3dd/0xb90 [obdfilter]
default_wake_function+0x0/0x20
filter_direct_io+0x2fb/0x990 [obdfilter]
filter_preprw_read+0x5c5/0xe00 [obdfilter]
lustre_swab_niobuf_remote+0x0/0x30 [ptlrpc]
ost_brw_read+0x18df/0x2400 [ost]
ost_handle+0x14c2/0x42d0 [ost]
ptlrpc_server_handle_request+0x870/0x10b0 [ptlrpc]
ptlrpc_main+0x42e/0x7c0 [ptlrpc]
If you know the exact reason for the error, then it is safe to proceed with no further
action. If you do not know the reason, then this is a serious issue and you should
explore it with your disk vendor.
If the error occurs during failover, examine your disk cache settings. If it occurs after
a restart without failover, try to determine how the disk can report that a write
succeeded, then lose the Data Device corruption or Disk Errors.
After the file system has been running for some time, it contains more data in cache
and hence, the variability caused by reading critical metadata from disk is mostly
eliminated. The file system now reads data from the cache.
For information on determining the MDS memory and OSS memory requirements,
see Section 5.4, “Determining Memory Requirements” on page 5-11.
If you suspect bad I/O performance and an analysis of Lustre statistics indicates that
I/O is not 1 MB, check /sys/block/<device>/queue/max_sectors_kb. If the
max_sectors_kb value is less than 1024, set it to at least 1024 to improve
performance. If changing max_sectors_kb does not change the I/O size as
reported by Lustre, you may want to examine the SCSI driver code.
Troubleshooting Recovery
Note – For a description of how recovery is implemented in Lustre, see Chapter 30:
Troubleshooting Recovery.
27-1
27.1 Recovering from Errors or Corruption on
a Backing File System
When an OSS, MDS, or MGS server crash occurs, it is not necessary to run e2fsck on
the file system. ldiskfs journaling ensures that the file system remains coherent. The
backing file systems are never accessed directly from the client, so client crashes are
not relevant.
The only time it is REQUIRED that e2fsck be run on a device is when an event causes
problems that ldiskfs journaling is unable to handle, such as a hardware device
failure or I/O error. If the ldiskfs kernel code detects corruption on the disk, it
mounts the file system as read-only to prevent further corruption, but still allows
read access to the device. This appears as error "-30" (EROFS) in the syslogs on the
server, e.g.:
In such a situation, it is normally required that e2fsck only be run on the bad device
before placing the device back into service.
In the vast majority of cases, Lustre can cope with any inconsistencies it finds on the
disk and between other devices in the file system.
For problem analysis, it is strongly recommended that e2fsck be run under a logger,
like script, to record all of the output and changes that are made to the file system in
case this information is needed later.
If time permits, it is also a good idea to first run e2fsck in non-fixing mode (-n
option) to assess the type and extent of damage to the file system. The drawback is
that in this mode, e2fsck does not recover the file system journal, so there may
appear to be file system corruption when none really exists.
To address concern about whether corruption is real or only due to the journal not
being replayed, you can briefly mount and unmount the ldiskfs filesystem directly on
the node with Lustre stopped (NOT via Lustre), using a command similar to:
In addition, the e2fsprogs package contains the lfsck tool, which does distributed
coherency checking for the Lustre file system after e2fsck has been run. Running
lfsck is NOT required in a large majority of cases, at a small risk of having some
leaked space in the file system. To avoid a lengthy downtime, it can be run (with
care) after Lustre is started.
2. Run e2fsck -f on the individual MDS / OST that had problems to fix any
local file system damage.
We recommend running e2fsck under script, to create a log of changes made to the
file system in case it is needed later. After e2fsck is run, bring up the file system, if
necessary, to reduce the outage window.
3. Run a full e2fsck of the MDS to create a database for lfsck. You must use the -n
option for a mounted file system, otherwise you will corrupt the file system.
4. Make this file accessible on all OSTs, either by using a shared file system or
copying the file to the OSTs. The pdcp command is useful here.
The pdcp command (installed with pdsh), can be used to copy files to groups of
hosts. Pdcp is available here:
http://sourceforge.net/projects/pdsh
5. Run a similar e2fsck step on the OSTs. The e2fsck --ostdb command can be
run in parallel on all OSTs.
Note – If the OSTs do not have shared file system access to the MDS, a stub mdsdb
file, {mdsdb}.mdshdr, is generated. This can be used instead of the full mdsdb file.
script /root/lfsck.lustre.log
lfsck -n -v --mdsdb /tmp/mdsdb --ostdb /tmp/{ost1db} /tmp/{ost2db}
... /lustre/mount/point
Example:
script /root/lfsck.lustre.log
lfsck -n -v --mdsdb /home/mdsdb --ostdb /home/{ost1db} \
/mnt/lustre/client/
MDSDB: /home/mdsdb
OSTDB[0]: /home/ostdb
MOUNTPOINT: /mnt/lustre/client/
MDS: max_id 288 OST: max_id 321
lfsck: ost_idx 0: pass1: check for duplicate objects
lfsck: ost_idx 0: pass1 OK (287 files total)
lfsck: ost_idx 0: pass2: check for missing inode objects
lfsck: ost_idx 0: pass2 OK (287 objects)
lfsck: ost_idx 0: pass3: check for orphan objects
[0] uuid lustre-OST0000_UUID
[0] last_id 288
[0] zero-length orphan objid 1
lfsck: ost_idx 0: pass3 OK (321 files total)
lfsck: pass4: check for duplicate object references
lfsck: pass4 OK (no duplicates)
lfsck: fixed 0 errors
By default, lfsck reports errors, but it does not repair any inconsistencies found.
lfsck checks for three kinds of inconsistencies:
■ Inode exists but has missing objects (dangling inode). This normally happens if
there was a problem with an OST.
■ Inode is missing but OST has unreferenced objects (orphan object). Normally,
this happens if there was a problem with the MDS.
■ Multiple inodes reference the same objects. This can happen if the MDS is
corrupted or if the MDS storage is cached and loses some, but not all, writes.
If the file system is in use and being modified while the --mdsdb and --ostdb
steps are running, lfsck may report inconsistencies where none exist due to files
and objects being created/removed after the database files were collected.
Examine the lfsck results closely. You may want to re-run the test.
To fix dangling inodes, use lfsck with the -c option to create new, zero-length
objects on the OSTs. These files read back with binary zeros for stripes that had
objects re-created. Even without lfsck repair, these files can be read by entering:
Because it is rarely useful to have files with large holes in them, most users delete
these files after reading them (if useful) and/or restoring them from backup.
Note – You cannot write to the holes of such files without having lfsck re-create
the objects. Generally, it is easier to delete these files and restore them from backup.
To fix inodes with duplicate objects, use lfsck with the -c option to copy the
duplicate object to a new object and assign it to a file. One file will be okay and the
duplicate will likely contain garbage. By itself, lfsck cannot tell which file is the
usable one.
During recovery, clients reconnect and replay their requests serially, in the same
order they were done originally. Until a client receives a confirmation that a given
transaction has been written to stable storage, the client holds on to the transaction,
in case it needs to be replayed. Periodically, a progress message prints to the log,
stating how_many/expected clients have reconnected. If the recovery is aborted, this
log shows how many clients managed to reconnect. When all clients have completed
recovery, or if the recovery timeout is reached, the recovery period ends and the OST
resumes normal request processing.
If some clients fail to replay their requests during the recovery period, this will not
stop the recovery from completing. You may have a situation where the OST
recovers, but some clients are not able to participate in recovery (e.g. network
problems or client failure), so they are evicted and their requests are not replayed.
This would result in any operations on the evicted clients failing, including
in-progress writes, which would cause cached writes to be lost. This is a normal
outcome; the recovery cannot wait indefinitely, or the file system would be hung any
time a client failed. The lost transactions are an unfortunate result of the recovery
process.
Lustre Debugging
This chapter describes tips and information to debug Lustre, and includes the
following sections:
■ Diagnostic and Debugging Tools
■ Lustre Debugging Procedures
■ Lustre Debugging for Developers
28-1
28.1 Diagnostic and Debugging Tools
A variety of diagnostic and analysis tools are available to debug issues with the
Lustre software. Some of these are provided in Linux distributions, while others have
been developed and are made available by the Lustre project.
The following tools are also provided with the Lustre software:
■ lctl - This tool is used with the debug_kernel option to manually dump the
Lustre debugging log or post-process debugging logs that are dumped
automatically. For more information about the lctl tool, see Section 28.2.2,
“Using the lctl Tool to View Debug Messages” on page 28-7 and Section 36.3,
“lctl” on page 36-4.
■ Lustre subsystem asserts - A panic-style assertion (LBUG) in the kernel causes
Lustre to dump the debug log to the file /tmp/lustre-log.<timestamp>
where it can be retrieved after a reboot. For more information, see Section 26.1.2,
“Viewing Error Messages” on page 26-3.
■ lfs - This utility provides access to the extended attributes (EAs) of a Lustre file
(along with other information). For more inforamtion about lfs, see Section 32.1,
“lfs” on page 32-2.
The following logging and data collection tools can be used to collect information for
debugging Lustre kernel issues:
■ kdump. A Linux kernel crash utility useful for debugging a system running Red
Hat Enterprise Linux. For more information about kdump, see the Red Hat
knowledge base article How do I configure kexec/kdump on Red Hat Enterprise
Linux 5?. To download kdump, go to the Fedora Project Download site.
■ netconsole. Enables kernel-level network logging over UDP. A system requires
(SysRq) allows users to collect relevant data through netconsole.
■ netdump. A crash dump utility from Red Hat that allows memory images to be
dumped over a network to a central server for analysis. The netdump utility was
replaced by kdump in RHEL 5. For more information about netdump, see Red
Hat, Inc.'s Network Console and Crash Dump Facility.
Note – For a current list of subsystems and debug message types, see
lnet/include/libcfs/libcfs.h in the Lustre tree
The lements of a Lustre debug message are described in Section 28.2.1.2, “Format of
Lustre Debug Messages” on page 28-6.
Types Description
Parameter Description
subsystem 800000
debug mask 000010
smp_processor_id 0
sec.used 10818808
47.677302
stack size 1204:
pid 2973:
Note – When lctl filters, it removes unwanted lines from the displayed output.
This does not affect the contents of the debug log in the kernel's memory. As a result,
you can print the log many times with different filtering levels without worrying
about losing data.
debug_kernel pulls the data from the kernel logs, filters it appropriately, and
displays or saves it as per the specified options
If the debugging is being done on User Mode Linux (UML), it might be useful to
save the logs on the host machine so that they can be used at a later time.
■ Filter a log on disk, if you already have a debug log saved to disk (likely from a
crash):
The marker text defaults to the current date and time in the debug log (similar to
the example shown below):
Note – Debug messages displayed with lctl are also subject to the kernel debug
masks; the filters are additive.
bash-2.04# ./lctl
lctl > debug_kernel /tmp/lustre_logs/log_all
Debug log: 324 lines, 324 kept, 0 dropped.
lctl > filter trace
Disabling output of type "trace"
lctl > debug_kernel /tmp/lustre_logs/log_notrace
Debug log: 324 lines, 282 kept, 42 dropped.
lctl > show trace
Enabling output of type "trace"
lctl > filter portals
Disabling output from subsystem "portals"
lctl > debug_kernel /tmp/lustre_logs/log_noportals
Debug log: 324 lines, 258 kept, 66 dropped.
The debug_daemon is highly dependent on file system write speed. File system
write operations may not be fast enough to flush out all of the debug_buffer if the
Lustre file system is under heavy system load and continues to CDEBUG to the
debug_buffer. The debug_daemon will write the message DEBUG MARKER:
Trace buffer full into the debug_buffer to indicate the debug_buffer
contents are overlapping before the debug_daemon flushes data to a file.
Users can use lctl control to start or stop the Lustre daemon from dumping the
debug_buffer to a file. Users can also temporarily hold daemon from dumping the
file. Use of the debug_daemon sub-command to lctl can provide the same
function.
The daemon wraps around and dumps data to the beginning of the file when the
output file size is over the limit of the user-specified file size. To decode the dumped
file to ASCII and order the log entries by time, run:
To completely shut down the debug_daemon operation and flush the file output,
enter:
debug_daemon stop
Otherwise, debug_daemon is shut down as part of the Lustre file system shutdown
process. Users can restart debug_daemon by using start command after each stop
command issued.
This is an example using debug_daemon with the interactive mode of lctl to dump
debug logs to a 10 MB file.
#~/utils/lctl
The text message *** End of debug_daemon trace log *** appears at the
end of each output file.
sysctl -w lnet.debug=0
sysctl -w lnet.debug=-1
sysctl -w lnet.debug=net
sysctl -w lnet.debug=+net
sysctl -w lnet.debug=-net
The various options available to print to kernel debug logs are listed in
lnet/include/libcfs/libcfs.h
Sometimes, a system call may fork child processes. In this situation, use the -f
option of strace to trace the child processes:
Use the -ff option, along with -o, to save the trace output in filename.pid, where
pid is the process ID of the process being traced. Use the -ttt option to timestamp
all lines in the strace output, so they can be correlated to operations in the lustre
kernel debug log.
If the debugging is done in UML, save the traces on the host machine. In this
example, hostfs is mounted on /r:
$ strace -o /r/tmp/vi.strace
The lfs getstripe utility is written in C; it takes a Lustre filename as input and
lists all the objects that form a part of this file. To obtain this information for the file
/mnt/lustre/frog in Lustre file system, run:
The debugfs tool is provided in the e2fsprogs package. It can be used for interactive
debugging of an ldiskfs file system. The debugfs tool can either be used to check
status or modify information in the file system. In Lustre, all objects that belong to a
file are stored in an underlying ldiskfs file system on the OSTs. The file system uses
the object IDs as the file names. Once the object IDs are known, use the debugfs tool
to obtain the attributes of all objects from different OSTs.
A sample run for the /mnt/lustre/frog file used in the above example is shown
here:
$ debugfs -c /tmp/ost1
debugfs: cd O
debugfs: cd 0 /* for files in group 0 */
debugfs: cd d<objid % 32>
debugfs: stat <objid> /* for getattr on object */
sysctl -w lnet.printk=-1
This slows down the system dramatically. It is also possible to selectively enable or
disable this capability for particular flags using:
sysctl -w lnet.printk=+vfstrace
sysctl -w lnet.printk=-vfstrace
To use these macros, you will need to set the DEBUG_SUBSYSTEM variable at the top
of the file as shown below:
Macro Description
ENTRY and EXIT Add messages to aid in call tracing (takes no arguments).
When using these macros, cover all exit conditions to avoid
confusion when the debug log reports that a function was
entered, but never exited.
LDLM_DEBUG and Used when tracing MDS and VFS operations for locking.
LDLM_DEBUG_NOLOCK These macros build a thin trace that shows the protocol
exchanges between nodes.
DEBUG_REQ Prints information about the given ptlrpc_request structure.
OBD_FAIL_CHECK Allows insertion of failure points into the Lustre code. This is
useful to generate regression tests that can hit a very specific
sequence of events. This works in conjunction with "sysctl -w
lustre.fail_loc={fail_loc}" to set a specific failure point for
which a given OBD_FAIL_CHECK will test.
OBD_FAIL_TIMEOUT Similar to OBD_FAIL_CHECK. Useful to simulate
hung, blocked or busy processes or network devices. If
the given fail_loc is hit, OBD_FAIL_TIMEOUT waits
for the specified number of seconds.
OBD_RACE Similar to OBD_FAIL_CHECK. Useful to have multiple
processes execute the same code concurrently to
provoke locking races. The first process to hit
OBD_RACE sleeps until a second process hits
OBD_RACE, then both processes continue.
OBD_FAIL_ONCE A flag set on a lustre.fail_loc breakpoint to cause the
OBD_FAIL_CHECK condition to be hit only one time.
Otherwise, a fail_loc is permanent until it is cleared
with "sysctl -w lustre.fail_loc=0".
OBD_FAIL_RAND Has OBD_FAIL_CHECK fail randomly; on average
every (1 / lustre.fail_val) times.
OBD_FAIL_SKIP Has OBD_FAIL_CHECK succeed lustre.fail_val times,
and then fail permanently or once with
OBD_FAIL_ONCE.
OBD_FAIL_SOME Has OBD_FAIL_CHECK fail lustre.fail_val times, and then
succeed.
Ptlrpc is an RPC protocol layered on LNET that deals with stateful servers and has
semantics and built-in support for recovery.
2. When a request buffer becomes idle, it is added to the service's request buffer
history list.
3. Buffers are culled from the service's request buffer history if it has grown above
req_buffer_history_max and its reqs are removed from the service's request
history.
Request history is accessed and controlled using the following /proc files under the
service directory:
■ req_buffer_history_len
Number of request buffers currently in the history
■ req_buffer_history_max
Maximum number of request buffers to keep
■ req_history
The request history
Requests in the history include "live" requests that are currently being handled. Each
line in req_history looks like:
Parameter Description
Before running this program, you must turn on debugging to collect all malloc and
free entries. Run:
sysctl -w lnet.debug=+malloc
1. Dump the log into a user-specified log file using lctl (see Section 28.2.2,
“Using the lctl Tool to View Debug Messages” on page 28-7).
The tool displays the following output to show the leaks found:
Leak:32bytes allocated at a23a8fc
(service.c:ptlrpc_init_svc:144,debug file line 241)
Part VI includes reference information on Lustre user utilities, configuration files and
module parameters, programming interfaces, system configuration utilities, and
system limits. You will find information in this section about:
Lustre Recovery
LustreProc
User Utilities
If you need to build a customized Lustre server kernel or are using a Linux kernel
that has not been tested with the version of Lustre you are installing, you may need
to build and install Lustre from source code. This chapter describes:
■ Overview and Prerequisites
■ Patching the Kernel
■ Creating and Installing the Lustre Packages
■ Installing Lustre with a Third-Party Network Stack
29-1
29.1 Overview and Prerequisites
Lustre can be installed from either pre-built binary packages (RPMs) or
freely-available source code. Installing from the package release is recommended
unless you need to customize the Lustre server kernel or will be using an Linux
kernel that has not been tested with Lustre. For a list of supported Linux
distributions and architectures, see the topic Lustre_2.0 on the Lustre wiki. The
procedure for installing Lustre from RPMs is describe in Chapter 8: Installing the
Lustre Software.
Note – When using third-party network hardware with Lustre, the third-party
modules (typically, the drivers) must be linked against the Linux kernel. The LNET
modules in Lustre also need these references. To meet these requirements, a specific
process must be followed to install and recompile Lustre. See Section 29.4, “Installing
Lustre with a Third-Party Network Stack” on page 29-9, for an example showing
how to install Lustre 1.6.6 using the Myricom MX 1.2.7 driver. The same process can
be used for other third-party network stacks.
Quilt manages a stack of patches on a single source tree. A series file lists the patch
files and the order in which they are applied. Patches are applied, incrementally, on
the base tree and all preceding patches. You can:
■ Apply patches from the stack (quilt push)
■ Remove patches from the stack (quilt pop)
■ Query the contents of the series file (quilt series), the contents of the stack
(quilt applied, quilt previous, quilt top), and the patches that are not
applied at a particular moment (quilt next, quilt unapplied).
■ Edit and refresh (update) patches with Quilt, as well as revert inadvertent
changes, and fork or clone patches and show the diffs before and after work.
A variety of Quilt packages (RPMs, SRPMs and tarballs) are available from various
sources. Use the most recent version you can find. Quilt depends on several other
utilities, e.g., the coreutils RPM that is only available in RedHat 9. For other
RedHat kernels, you have to get the required packages to successfully install Quilt. If
you cannot locate a Quilt package or fulfill its dependencies, you can build Quilt
from a tarball, available at the Quilt project website:
http://savannah.nongnu.org/projects/quilt
For additional information on using Quilt, including its commands, see Introduction
to Quilt and the quilt(1) man page.
Caution – Lustre contains kernel modifications which interact with storage devices
and may introduce security issues and data loss if not installed, configured and
administered correctly. Before installing Lustre, be cautious and back up ALL data.
Note – Each patch series has been tailored to a specific kernel version, and may or
may not apply cleanly to other versions of the kernel.
1. Verify that all of the Lustre installation requirements have been met.
For more information on these prerequisites, see:
■ Hardware requirements in Chapter 5: Setting Up a Lustre File System
■ Software and environmental requirements in Section 8.1, “Preparing to Install
the Lustre Software” on page 8-2
2. Select a config file for your kernel, located in the kernel_configs directory
(lustre/kernel_patches/kernel_config).
The kernel_config directory contains the .config files, which are named to
indicate the kernel and architecture with which they are associated. For example,
the configuration file for the 2.6.18 kernel shipped with RHEL 5 (suitable for i686
SMP systems) is kernel-2.6.18-2.6-rhel5-i686-smp.config.
3. Select the series file for your kernel, located in the series directory
(lustre/kernel_patches/series).
The series file contains the patches that need to be applied to the kernel.
4. Set up the necessary symlinks between the kernel patches and the Lustre
source.
This example assumes that the Lustre source files are unpacked under
/tmp/lustre-1.6.5.1 and you have chosen the 2.6-rhel5.series file). Run:
$ cd /tmp/kernels/linux-2.6.18
$ rm -f patches series
$ ln -s /tmp/lustre-1.6.5.1/lustre/kernel_patches/series/2.6-\
rhel5.series ./series
$ ln -s /tmp/lustre-1.6.5.1/lustre/kernel_patches/patches .
5. Use Quilt to apply the patches in the selected series file to the unpatched
kernel. Run:
$ cd /tmp/kernels/linux-2.6.18
$ quilt push -av
The patched destination tree acts as a base Linux source tree for Lustre.
2. Run the Lustre configure script against the patched kernel and create the Lustre
packages.
Note – You do not need to run the Lustre configure script against an unpatched
kernel.
lustre-1.6.5.1-\
2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm
lustre-debuginfo-1.6.5.1-\
2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm
lustre-modules-1.6.5.1-\
2.6.18_53.xx.xxel5_lustre.1.6.5.1.custom_20081021.i686.rpm
lustre-source-1.6.5.1-\
2.6.18_53.xx.xx.el5_lustre.1.6.5.1.custom_20081021.i686.rpm
Note – Several features and packages are available that extend the core functionality
of Lustre. These features/packages can be enabled at the build time by issuing
appropriate arguments to the configure command. For a list of these features and
packages, run ./configure –help in the Lustre source tree. The configs/
directory of the kernel source contains the config files matching each the kernel
version. Copy one to .config at the root of the kernel tree.
3. Create the kernel package. Navigate to the kernel source directory and run:
$ make rpm
Example result:
kernel-2.6.95.0.3.EL_lustre.1.6.5.1custom-1.i686.rpm
Note – Step 3 is only valid for RedHat and SuSE kernels. If you are using a stock
Linux kernel, you need to get a script to create the kernel RPM.
Note – Running the patched server kernel on the clients is optional. It is not
necessary unless the clients will be used for multiple purposes, for example, to run as
a client and an OST.
$ rpm -i e2fsprogs-<ver>
5. Verify that the boot loader (grub.conf or lilo.conf) has been updated to load the
patched kernel.
Once all the machines have rebooted, the next steps are to configure Lustre
Networking (LNET) and the Lustre file system. See Chapter 10: Configuring Lustre.
$ rpm -ivh \
kernel-lustre-source-2.6.18-92.1.10.el5_lustre.1.6.6.x86_64.rpm
$ cd /usr/src/linux-2.6.18-92.1.10.el5_lustre.1.6.6
$ make distclean
$ make oldconfig dep bzImage modules
$ cp /boot/config-`uname -r` .config
$ make oldconfig || make menuconfig
$ make include/asm
$ make include/linux/version.h
$ make SUBDIRS=scripts
$ make rpm
$ rpm -ivh \
~/rpmbuild/kernel-lustre-2.6.18-92.1.10.el5_\
lustre.1.6.6.x86_64.rpm
$ mkinitrd /boot/2.6.18-92.1.10.el5_lustre.1.6.6
c. Update the boot loader (/etc/grub.conf) with the new kernel boot
information.
$ /sbin/shutdown 0 -r
$ cd /usr/src/
$ gunzip mx_1.2.7.tar.gz (can be obtained from www.myri.com/scs/)
$ tar -xvf mx_1.2.7.tar
$ cd mx-1.2.7
$ ln -s common include
$ ./configure --with-kernel-lib
$ make
$ make install
a. Install the Lustre source (this can be done via RPM or tarball). The source file
is available at the Lustre download site. This example shows installation via
the tarball.
$ cd /usr/src/
$ gunzip lustre-1.6.6.tar.gz
$ tar -xvf lustre-1.6.6.tar
$ cd lustre-1.6.6
$ ./configure --with-linux=/usr/src/linux \
--with-mx=/usr/src/mx-1.2.7
$ make
$ make rpms
The make rpms command output shows the location of the generated RPMs
$ rpm -ivh \
lustre-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6smp.x86_64.rpm
$ rpm -ivh \
lustre-modules-1.6.6-2.6.18_92.1.10.el5_lustre.1.6.6\
smp.x86_64.rpm
$ rpm -ivh \
lustre-ldiskfs-3.0.6-2.6.18_92.1.10.el5_lustre.1.6.6\
smp.x86_64.rpm
vim /etc/sysconfig/network-scripts/myri0
8. Start Lustre.
Once all the machines have rebooted, the next steps are to configure Lustre
Networking (LNET) and the Lustre file system. See Chapter 10: Installing Lustre
from Source Code.
Lustre Recovery
This chapter describes how recovery is implemented in Lustre and includes the
following sections:
■ Recovery Overview
■ Metadata Replay
■ Reply Reconstruction
■ Version-based Recovery
■ Commit on Share
Note – Usually the Lustre recovery process is transparent. For information about
troubleshooting recovery when something goes wrong, see Chapter 27: Lustre
Recovery.
30-1
30.1 Recovery Overview
Lustre's recovery feature is responsible for dealing with node or network failure and
returning the cluster to a consistent, performant state. Because Lustre allows servers
to perform asynchronous update operations to the on-disk file system (i.e., the server
can reply without waiting for the update to synchronously commit to disk), the
clients may have state in memory that is newer than what the server can recover
from disk after a crash.
Currently, all Lustre failure and recovery operations are based on the concept of
connection failure; all imports or exports associated with a given connection are
considered to fail if any of them fail.
The reconnection to a new (or restarted) MDS is managed by the file system
configuration loaded by the client when the file system is first mounted. If a failover
MDS has been configured (using the --failnode= option to mkfs.lustre or
tunefs.lustre), the client tries to reconnect to both the primary and backup MDS
until one of them responds that the failed MDT is again available. At that point, the
client begins recovery. For more information, see Section 30.2, “Metadata Replay” on
page 30-6.
The MDS (via the LOV) detects that an OST is unavailable and skips it when
assigning objects to new files. When the OST is restarted or re-establishes
communication with the MDS, the MDS and OST automatically perform orphan
recovery to destroy any objects that belong to files that were deleted while the OST
was unavailable. For more information, see Section 27.2.1, “Working with Orphaned
Objects” on page 27-8.
While the OSC to OST operation recovery protocol is the same as that between the
MDC and MDT using the Metadata Replay protocol, typically the OST commits bulk
write operations to disk synchronously and each reply indicates that the request is
already committed and the data does not need to be saved for recovery. In some
cases, the OST replies to the client before the operation is committed to disk (e.g.
truncate, destroy, setattr, and I/O operations in very new versions of Lustre), and
normal replay and resend handling is done, including resending of the bulk writes.
In this case, the client keeps a copy of the data available in memory until the server
indicates that the write has committed to disk.
To force an OST recovery, unmount the OST and then mount it again. If the OST was
connected to clients before it failed, then a recovery process starts after the remount,
enabling clients to reconnect to the OST and replay transactions in their queue. When
the OST is in recovery mode, all new client connections are refused until the recovery
finishes. The recovery is complete when either all previously-connected clients
reconnect and their transactions are replayed or a client connection attempt times
out. If a connection attempt times out, then all clients waiting to reconnect (and their
transactions) are lost.
To determine an OST’s device number and device name, run the lctl dl command.
Sample lctl dl command output is shown below:
If a request was processed by the server, but the reply was dropped (i.e., did not
arrive back at the client), the server must reconstruct the reply when the client
resends the request, rather than performing the same request twice.
Metadata replay ensures that the failover MDS re-accumulates state resulting from
transactions whose effects were made visible to clients, but which were not
committed to the disk.
Each reply sent to a client (regardless of request type) also contains the last
committed transaction number that indicates the highest transaction number
committed to the file system. The ldiskfs backing file system that Lustre uses
enforces the requirement that any earlier disk operation will always be committed to
disk before a later disk operation, so the last committed transaction number also
reports that any requests with a lower transaction number have been committed to
disk.
Replay operations are those for which the client received a reply from the server that
the operation had been successfully completed. These operations need to be redone
in exactly the same manner after a server restart as had been reported before the
server failed. Replay can only happen if the server failed; otherwise it will not have
lost any state in memory.
Resend operations are those for which the client never received a reply, so their final
state is unknown to the client. The client sends unanswered requests to the server
again in XID order, and again awaits a reply for each one. In some cases, resent
requests have been handled and committed to disk by the server (possibly also
having dependent operations committed), in which case, the server performs reply
reconstruction for the lost reply. In other cases, the server did not receive the lost
request at all and processing proceeds as with any normal request. These are what
happen in the case of a network interruption. It is also possible that the server
received the request, but was unable to reply or commit it to disk before failure.
In the absence of any client connection attempts, the server waits indefinitely for the
clients to reconnect. This is intended to handle the case where the server has a
network problem and clients are unable to reconnect and/or if the server needs to be
restarted repeatedly to resolve some problem with hardware or software. Once the
server detects client connection attempts - either new clients or previously-connected
clients - a recovery timer starts and forces recovery to finish in a finite time regardless
of whether the previously-connected clients are available or not.
If no client entries are present in the last_rcvd file, or if the administrator manually
aborts recovery, the server does not wait for client reconnection and proceeds to
allow all clients to connect.
As clients connect, the server gathers information from each one to determine how
long the recovery needs to take. Each client reports its connection UUID, and the
server does a lookup for this UUID in the last_rcvd file to determine if this client was
previously connected. If not, the client is refused connection and it will retry until
recovery is completed. Each client reports its last seen transaction, so the server
knows when all transactions have been replayed. The client also reports the amount
of time that it was previously waiting for request completion so that the server can
estimate how long some clients might need to detect the server failure and reconnect.
If the client times out during replay, it attempts to reconnect. If the client is unable to
reconnect, REPLAY fails and it returns to DISCON state. It is possible that clients will
timeout frequently during REPLAY, so reconnection should not delay an already
slow process more than necessary. We can mitigate this by increasing the timeout
during replay.
Open requests that are on the replay list may have a transaction number lower than
the server's last committed transaction number. The server processes those open
requests immediately. The server then processes replayed requests from all of the
clients in transaction number order, starting at the last committed transaction
number to ensure that the state is updated on disk in exactly the same manner as it
was before the crash. As each replayed request is processed, the last committed
transaction is incremented. If the server receives a replay request from a client that is
higher than the current last committed transaction, that request is put aside until
other clients provide the intervening transactions. In this manner, the server replays
requests in the same sequence as they were previously executed on the server until
either all clients are out of requests to replay or there is a gap in a sequence.
In the case where all clients have reconnected, but there is a gap in the replay
sequence the only possibility is that some requests were processed by the server but
the reply was lost. Since the client must still have these requests in its resend list,
they are processed after recovery is finished.
In the case where all clients have not reconnected, it is likely that the failed clients
had requests that will no longer be replayed. The VBR feature is used to determine if
a request following a transaction gap is safe to be replayed. Each item in the file
system (MDS inode or OST object) stores on disk the number of the last transaction
in which it was modified. Each reply from the server contains the previous version
number of the objects that it affects. During VBR replay, the server matches the
previous version numbers in the resend request against the current version number.
If the versions match, the request is the next one that affects the object and can be
safely replayed. For more information, see Section 30.4, “Version-based Recovery” on
page 30-13.
After all of the saved requests and locks have been replayed, the client sends an
MDS_GETSTATUS request with last-replay flag set. The reply to that request is held
back until all clients have completed replay (sent the same flagged getstatus request),
so that clients don't send non-recovery requests before recovery is complete.
For open requests, the "disposition" of the open must also be stored.
The disposition, status and request data (re-sent intact by the client) are sufficient to
determine which type of lock handle was granted, whether an open file handle was
created, and which resource should be described in the mds_body.
In pre-VBR versions of Lustre, if the MGS or an OST went down and then recovered,
a recovery process was triggered in which clients attempted to replay their requests.
Clients were only allowed to replay RPCs in serial order. If a particular client could
not replay its requests, then those requests were lost as well as the requests of clients
later in the sequence. The ''downstream'' clients never got to replay their requests
because of the wait on the earlier client’s RPCs. Eventually, the recovery period
would time out (so the component could accept new requests), leaving some number
of clients evicted and their requests and data lost.
With VBR, the recovery mechanism does not result in the loss of clients or their data,
because changes in inode versions are tracked, and more clients are able to
reintegrate into the cluster. With VBR, inode tracking looks like this:
■ Each inode2 stores a version, that is, the number of the last transaction (transno) in
which the inode was changed.
■ When an inode is about to be changed, a pre-operation version of the inode is
saved in the client’s data.
■ The client keeps the pre-operation inode version and the post-operation version
(transaction number) for replay, and sends them in the event of a server failure.
■ If the pre-operation version matches, then the request is replayed. The
post-operation version is assigned on all inodes modified in the request.
1. There are two scenarios under which client RPCs are not replayed:
(1) Non-functioning or isolated clients do not reconnect, and they cannot replay their RPCs, causing a gap in
the replay sequence. These clients get errors and are evicted.
(2) Functioning clients connect, but they cannot replay some or all of their RPCs that occurred after the gap
caused by the non-functioning/isolated clients. These clients get errors (caused by the failed clients). With
VBR, these requests have a better chance to replay because the "gaps" are only related to specific files that the
missing client(s) changed.
2. Usually, there are two inodes, a parent and a child.
1. VBR only allows clients to replay transactions if the affected inodes have the same
version as during the original execution of the transactions, even if there is gap in
transactions due to a missed client.
2. The server attempts to execute every transaction that the client offers, even if it
encounters a re-integration failure.
3. When the replay is complete, the client and server check if a replay failed on any
transaction because of inode version mismatch. If the versions match, the client
gets a successful re-integration message. If the versions do not match, then the
client is evicted.
VBR recovery is fully transparent to users. It may lead to slightly longer recovery
times if the cluster loses several clients during server recovery.
This message indicates why the client was evicted. No action is needed.
If there was a dependency between client transactions (for example, creating and
deleting the same file), and one or more clients did not reconnect in time, then some
clients may have been evicted because their transactions depended on transactions
from the missing clients. Evictions of those clients caused more clients to be evicted
and so on, resulting in "cascading" client evictions.
To set a default value for COS (disable/enable) when the file system is created, use:
--param mdt.commit_on_sharing=0/1
Note – Enabling COS may cause the MDS to do a large number of synchronous disk
operations, hurting performance. Placing the ldiskfs journal on a low-latency external
device may improve file system performance.
LustreProc
The /proc file system acts as an interface to internal data structures in the kernel. The
/proc variables can be used to control aspects of Lustre performance and provide
information.
This chapter describes Lustre /proc entries and includes the following sections:
■ Proc Entries for Lustre
■ Lustre I/O Tunables
■ Debug
31-1
31.1 Proc Entries for Lustre
This section describes /proc entries for Lustre.
# cat /proc/fs/lustre/mgs/MGS/filesystems
spfs
lustre
■ The server names participating in a file system (for each file system that has at
least one server running)
# cat /proc/fs/lustre/mgs/MGS/live/spfs
fsname: spfs
flags: 0x0 gen: 7
spfs-MDT0000
spfs-OST0000
# cat /proc/fs/lustre/devices
0 UP mgs MGS MGS 11
1 UP mgc MGC192.168.10.34@tcp 1f45bb57-d9be-2ddb-c0b0-5431a49226705
2 UP mdt MDS MDS_uuid 3
3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 7
5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
7 UP lov lustre-clilov-ce63ca00 08ac6584-6c4a-3536-2c6d-b36cf9cbdaa04
8 UP mdc lustre-MDT0000-mdc-ce63ca00
08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
9 UP osc lustre-OST0000-osc-ce63ca00
08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
10 UP osc lustre-OST0001-osc-ce63ca00
08ac6584-6c4a-3536-2c6d-b36cf9cbdaa05
# e2label /dev/sda
lustre-MDT0000
/proc/sys/lustre/timeout
This is the time period that a client waits for a server to complete an RPC (default is
100s). Servers wait half of this time for a normal client RPC to complete and a quarter
of this time for a single bulk request (read or write of up to 1 MB) to complete. The
client pings recoverable targets (MDS and OSTs) at one quarter of the timeout, and
the server waits one and a half times the timeout before evicting a client for being
"stale."
Note – Lustre sends periodic ‘PING’ messages to servers with which it had no
communication for a specified period of time. Any network activity on the file
system that triggers network traffic toward servers also works as a health check.
/proc/sys/lustre/ldlm_timeout
This is the time period for which a server will wait for a client to reply to an initial
AST (lock cancellation request) where default is 20s for an OST and 6s for an MDS. If
the client replies to the AST, the server will give it a normal timeout (half of the client
timeout) to flush any dirty data and release the lock.
/proc/sys/lustre/dump_on_timeout
This triggers dumps of the Lustre debug log when timeouts occur. The default value
is 0 (zero).
/proc/sys/lustre/dump_on_eviction
This triggers dumps of the Lustre debug log when an eviction occurs. The default
value is 0 (zero). By default, debug logs are dumped to the /tmp folder; this location
can be changed via /proc.
If RPCs queued on the server approach their timeouts, then the server sends an early
reply to the client, telling the client to allow more time. In this manner, clients avoid
RPC timeouts and disconnect/reconnect cycles. Conversely, as a server speeds up,
RPC timeout values decrease, allowing faster detection of non-responsive servers and
faster attempts to reconnect to a server's failover partner.
Note – Nodes using multiple Lustre file systems must use the same at_* values for
all file systems.)
Parameter Description
at_min Sets the minimum adaptive timeout (in seconds). Default value is 0.
The at_min parameter is the minimum processing time that a server
will report. Clients base their timeouts on this value, but they do not
use this value directly. If you experience cases in which, for unknown
reasons, the adaptive timeout value is too short and clients time out
their RPCs (usually due to temporary network outages), then you
can increase the at_min value to compensate for this. Ideally, users
should leave at_min set to its default.
at_max Sets the maximum adaptive timeout (in seconds). The at_max
parameter is an upper-limit on the service time estimate, and is used
as a 'failsafe' in case of rogue/bad/buggy code that would lead to
never-ending estimate increases. If at_max is reached, an RPC
request is considered 'broken' and should time out.
Setting at_max to 0 causes adaptive timeouts to be disabled and the
old fixed-timeout method (obd_timeout) to be used. This is the
default value in Lustre 1.6.5.
at_early_margin Sets how far before the deadline Lustre sends an early reply. Default
value is 5*.
at_extra Sets the incremental amount of time that a server asks for, with each
early reply. The server does not know how much time the RPC will
take, so it asks for a fixed value. Default value is 30†. When a server
finds a queued request about to time out (and needs to send an early
reply out), the server adds the at_extra value. If the time expires,
the Lustre client enters recovery status and reconnects to restore it to
normal status.
If you see multiple early replies for the same RPC asking for multiple
30-second increases, change the at_extra value to a larger number
to cut down on early replies sent and, therefore, network load.
ldlm_enqueue_min Sets the minimum lock enqueue time. Default value is 100. The
ldlm_enqueue time is the maximum of the measured enqueue
estimate (influenced by at_min and at_max parameters), multiplied
by a weighting factor, and the ldlm_enqueue_min setting. LDLM
lock enqueues were based on the obd_timeout value; now they
have a dedicated minimum value. Lock enqueues increase as the
measured enqueue times increase (similar to adaptive timeouts).
* This default was chosen as a reasonable time in which to send a reply from the point at which it was sent.
† This default was chosen as a balance between sending too many early replies for the same RPC and overesti-
mating the actual completion time.
Adaptive timeouts are enabled, by default. To disable adaptive timeouts, at run time,
set at_max to 0. On the MGS, run:
Note – Changing adaptive timeouts status at runtime may cause transient timeout,
reconnect, recovery, etc.
The output also provides a history of service times. In the example, there are 4 "bins"
of adaptive_timeout_history, with the maximum RPC time in each bin
reported. In 0-150 seconds, the maximum RPC time was 1, with the same result in
150-300 seconds. From 300-450 seconds, the worst (maximum) RPC time was 33
seconds, and from 450-600s the worst time was 2 seconds. The current estimated
service time is the maximum value of the 4 bins (33 seconds in this example).
Service times (as reported by the servers) are also tracked in the client OBDs:
cfs21:# lctl get_param osc.*.timeouts
last reply : 1193428639, 0d0h00m00s ago
network : cur 1 worst 2 (at 1193427053, 0d0h26m26s ago) 1 1 1 1
portal 6 : cur 33 worst 34 (at 1193427052, 0d0h26m27s ago) 33 33 33 2
portal 28 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 1 1 1
portal 7 : cur 1 worst 1 (at 1193426141, 0d0h41m38s ago) 1 0 1 1
portal 17 : cur 1 worst 1 (at 1193426177, 0d0h41m02s ago) 1 0 0 1
Server statistic files also show the range of estimates in the normal
min/max/sum/sumsq manner.
/proc/sys/lnet/peers
Shows all NIDs known to this node and also gives information on the queue state.
# cat /proc/sys/lnet/peers
nid refs state max rtr min tx min queue
0@lo 1 ~rtr 0 0 0 0 0 0
192.168.10.35@tcp1 ~rtr 8 8 8 8 6 0
192.168.10.36@tcp1 ~rtr 8 8 8 8 6 0
192.168.10.37@tcp1 ~rtr 8 8 8 8 6 0
Field Description
Credits work like a semaphore. At start they are initialized to allow a certain number
of operations (8 in this example). LNET keeps a track of the minimum value so that
you can see how congested a resource was.
If rtr/tx is less than max, there are operations in progress. The number of
operations is equal to rtr or tx subtracted from max.
LNET also limits concurrent sends and router buffers allocated to a single peer so
that no peer can occupy all these resources.
# cat /proc/sys/lnet/nis
nid refs peer max tx min
0@lo 3 0 0 0 0
192.168.10.34@tcp 4 8 256 256 252
Shows the current queue health on this node. The fields are explained below:
Field Description
# cat /proc/sys/lnet/nis
nid refs peer max tx min
0@lo 2 0 0 0 0
10.67.73.173@tcp 4 8 256 256 253
$ cat /proc/fs/lustre/lov/<fsname>-mdtlov/qos_prio_free
Currently, the default is 90%. You can permanently set this value by running this
command on the MGS:
Setting the priority to 100% means that OSS distribution does not count in the
weighting, but the stripe assignment is still done via weighting. If OST 2 has twice as
much free space as OST 1, it is twice as likely to be used, but it is NOT guaranteed to
be used.
Also note that free-space stripe weighting does not activate until two OSTs are
imbalanced by more than 20%. Until then, a faster round-robin stripe allocator is
used. (The new round-robin order also maximizes network balancing.)
/proc/fs/lustre/llite/<fsname>-<uid>/max_cache_mb
This tunable is the maximum amount of inactive data cached by the client (default is
3/4 of RAM).
$ ls -d /proc/fs/lustre/osc/OSC_client_ost1_MNT_client_2 /localhost
/proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
/proc/fs/lustre/osc/OSC_uml0_ost2_MNT_localhost
/proc/fs/lustre/osc/OSC_uml0_ost3_MNT_localhost
$ ls /proc/fs/lustre/osc/OSC_uml0_ost1_MNT_localhost
blocksizefilesfreemax_dirty_mb ost_server_uuid stats
/proc/fs/lustre/osc/<object name>/max_dirty_mb
This tunable controls how many MBs of dirty data can be written and queued up in
the OSC. POSIX file writes that are cached contribute to this count. When the limit is
reached, additional writes stall until previously-cached writes are written to the
server. This may be changed by writing a single ASCII integer to the file. Only values
between 0 and 512 are allowable. If 0 is given, no writes are cached. Performance
suffers noticeably unless you use large writes (1 MB or more).
/proc/fs/lustre/osc/<object name>/cur_dirty_bytes
This tunable is a read-only value that returns the current amount of bytes written
and cached on this OSC.
This tunable is the maximum number of pages that will undergo I/O in a single RPC
to the OST. The minimum is a single page and the maximum for this setting is
platform dependent (256 for i386/x86_64, possibly less for ia64/PPC with larger
PAGE_SIZE), though generally amounts to a total of 1 MB in the RPC.
/proc/fs/lustre/osc/<object name>/max_rpcs_in_flight
This tunable is the maximum number of concurrent RPCs in flight from an OSC to its
OST. If the OSC tries to initiate an RPC but finds that it already has the same number
of RPCs outstanding, it will wait to issue further RPCs until some complete. The
minimum setting is 1 and maximum setting is 32. If you are looking to improve small
file I/O performance, increase the max_rpcs_in_flight value.
Note – The <object name> varies depending on the specific Lustre configuration.
For <object name> examples, refer to the sample command output.
# cat /proc/fs/lustre/osc/spfs-OST0000-osc-c45f9c00/rpc_stats
snapshot_time: 1174867307.156604 (secs.usecs)
read RPCs in flight: 0
write RPCs in flight: 0
pending write pages: 0
pending read pages: 0
read write
pages per rpc rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
rpcs in flight rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
read write
offset rpcs % cum % | rpcs % cum %
0: 0 0 0 | 0 0 0
Where:
Field Description
{read,write} RPCs in flight Number of read/write RPCs issued by the OSC, but not
complete at the time of the snapshot. This value should
always be less than or equal to max_rpcs_in_flight.
pending {read,write} pages Number of pending read/write pages that have been queued
for I/O in the OSC.
pages per RPC When an RPC is sent, the number of pages it consists of is
recorded (in order). A single page RPC increments the 0: row.
RPCs in flight When an RPC is sent, the number of other RPCs that are
pending is recorded. When the first RPC is sent, the 0: row is
incremented. If the first RPC is sent while another is pending,
the 1: row is incremented and so on. As each RPC
*completes*, the number of pending RPCs is not tabulated.
This table is a good way to visualize the concurrency of the
RPC stream. Ideally, you will see a large clump around the
max_rpcs_in_flight value, which shows that the network
is being kept busy.
offset
Read/write offset statistics are off, by default. The statistics can be activated by
writing anything into the offset_stats file.
Example:
# cat /proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
snapshot_time: 1155748884.591028 (secs.usecs)
R/W PID RANGE STARTRANGE ENDSMALLEST EXTENTLARGEST EXTENTOFFSET
R 8385 0 128 128 128 0
R 8385 0 224 224 224 -128
W 8385 0 250 50 100 0
W 8385 100 1110 10 500 -150
W 8384 0 5233 5233 5233 0
R 8385 500 600 100 100 -610
Where:
Field Description
echo >
/proc/fs/lustre/llite/lustre-f57dee00/rw_offset_stats
The rw_extent_stats histogram in the llite directory shows you the statistics for
the sizes of the read-write I/O extents. This file does not maintain the per-process
statistics.
Example:
$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats
snapshot_time: 1213828728.348516 (secs.usecs)
read | write
extents calls % cum% | calls % cum%
0K - 4K : 0 0 0 | 2 2 2
4K - 8K : 0 0 0 | 0 0 2
8K - 16K : 0 0 0 | 0 0 2
16K - 32K : 0 0 0 | 20 23 26
32K - 64K : 0 0 0 | 0 0 26
64K - 128K : 0 0 0 | 51 60 86
128K - 256K : 0 0 0 | 0 0 86
256K - 512K : 0 0 0 | 0 0 86
512K - 1024K : 0 0 0 | 0 0 86
1M - 2M : 0 0 0 | 11 13 100
Example:
$ cat /proc/fs/lustre/llite/lustre-ee5af200/extents_stats_per_process
snapshot_time: 1213828762.204440 (secs.usecs)
read | write
extents calls % cum% | calls % cum%
PID: 11488
0K - 4K : 0 0 0 | 0 0 0
4K - 8K : 0 0 0 | 0 0 0
8K - 16K : 0 0 0 | 0 0 0
PID: 11491
0K - 4K : 0 0 0 | 0 0 0
4K - 8K : 0 0 0 | 0 0 0
8K - 16K : 0 0 0 | 0 0 0
16K - 32K : 0 0 0 | 20 100 100
PID: 11424
0K - 4K : 0 0 0 | 0 0 0
4K - 8K : 0 0 0 | 0 0 0
8K - 16K : 0 0 0 | 0 0 0
16K - 32K : 0 0 0 | 0 0 0
32K - 64K : 0 0 0 | 0 0 0
64K - 128K : 0 0 0 | 16 100 100
PID: 11426
0K - 4K : 0 0 0 | 1 100 100
PID: 11429
0K - 4K : 0 0 0 | 1 100 100
cat /proc/fs/lustre/obdfilter/lustre-OST0000/brw_stats
snapshot_time: 1174875636.764630 (secs:usecs)
read write
pages per brw brws % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
discont pages rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
discont blocks rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
dio frags rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
disk ios in flight rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
io time (1/1000s) rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
disk io size rpcs % cum % | rpcs % cum %
1: 0 0 0 | 0 0 0
read write
Field Description
pages per brw Number of pages per RPC request, which should match aggregate client
rpc_stats.
discont pages Number of discontinuities in the logical file offset of each page in a
single RPC.
discont blocks Number of discontinuities in the physical block allocation in the file
system for a single RPC.
/proc/fs/lustre/llite/<fsname>-<uid>/max_read_ahead_mb
This tunable controls the maximum amount of data readahead on a file. Files are
read ahead in RPC-sized chunks (1 MB or the size of read() call, if larger) after the
second sequential read on a file descriptor. Random reads are done at the size of the
read() call only (no readahead). Reads to non-contiguous regions of the file reset the
readahead algorithm, and readahead is not triggered again until there are sequential
reads again. To disable readahead, set this tunable to 0. The default value is 40 MB.
/proc/fs/lustre/llite/<fsname>-<uid>/max_read_ahead_whole_mb
This tunable controls the maximum size of a file that is read in its entirety, regardless
of the size of the read().
/proc/fs/lustre/llite/*/statahead_max
This tunable controls whether directory statahead is enabled and the maximum
statahead count. By default, statahead is active.
To set the maximum statahead count (n), set this tunable to:
/proc/fs/lustre/llite/*/statahead_status
OSS read cache is enabled, by default, and managed by the following tunables:
■ read_cache_enable controls whether data read from disk during a read request
is kept in memory and available for later read requests for the same data, without
having to re-read it from disk. By default, read cache is enabled
(read_cache_enable = 1).
When the OSS receives a read request from a client, it reads data from disk into its
memory and sends the data as a reply to the requests. If read cache is enabled, this
data stays in memory after the client’s request is finished, and the OSS skips
reading data from disk when subsequent read requests for the same are received.
The read cache is managed by the Linux kernel globally across all OSTs on that
OSS, and the least recently used cache pages will be dropped from memory when
the amount of free memory is running low.
If read cache is disabled (read_cache_enable = 0), then the OSS will discard the
data after the client’s read requests are serviced and, for subsequent read requests,
the OSS must read the data from disk.
Note – Asynchronous journal commit cannot work with O_DIRECT writes, a journal
flush is still forced.
When asynchronous journal commit is enabled, client nodes keep data in the page
cache (a page reference). Lustre clients monitor the last committed transaction
number (transno) in messages sent from the OSS to the clients. When a client sees
that the last committed transno reported by the OSS is >=bulk write transno, it
releases the reference on the corresponding pages. To avoid page references being
held for too long on clients after a bulk write, a 7 second ping request is scheduled
(jbd commit time is 5 seconds) after the bulk write reply is received, so the OSS has
an opportunity to report the last committed transno.
If the OSS crashes before the journal commit occurs, then the intermediate data is
lost. However, new OSS recovery functionality (introduced in the asynchronous
journal commit feature), causes clients to replay their write requests and compensate
for the missing disk updates by restoring the state of the file system.
When asynchronous journal commit is used, clients keep a page reference until the
journal transaction commits. This can cause problems when a client receives a
blocking callback, because pages need to be removed from the page cache, but they
cannot be removed because of the extra page reference.
This problem is solved by forcing a journal flush on lock cancellation. When this
happens, the client is granted the metadata blocks that have hit the disk, and it can
safely release the page reference before processing the blocking callback. The
parameter which controls this action is sync_on_lock_cancel, which can be set to
the following values:
always: Always force a journal flush on lock cancellation
blocking: Force a journal flush only when the local cancellation is due to a
blocking callback
never: Do not force any journal flush
Parameter Description
Also, number-of-blocks-in-request (third number in the goal triple) can tell the
number of blocks requested by the obdfilter. If the obdfilter is doing a lot of small
requests (just few blocks), then either the client is processing input/output to a lot of
small files, or something may be wrong with the client (because it is better if client
sends large input/output requests). This can be investigated with the OSC
rpc_stats or OST brw_stats mentioned above.
Number of groups scanned (grps column) should be small. If it reaches a few dozen
often, then either your disk file system is pretty fragmented or mballoc is doing
something wrong in the group selection part.
Field Description
The following tunables, providing more control over allocation policy, will be
available in the next version:
Field Description
31.2.11 Locking
/proc/fs/lustre/ldlm/ldlm/namespaces/<OSC name|MDCname>/lru_size
The total number of locks available is a function of the server’s RAM. The default
limit is 50 locks/1 MB of RAM. If there is too much memory pressure, then the LRU
size is shrunk. The number of locks on the server is limited to {number of OST/MDT
on node} * {number of clients} * {client lru_size}.
■ To enable automatic LRU sizing, set the lru_size parameter to 0. In this case, the
lru_size parameter shows the current number of locks being used on the export.
(In Lustre 1.6.5.1 and later, LRU sizing is enabled, by default.)
■ To specify a maximum number of locks, set the lru_size parameter to a value >
0 (former numbers are okay, 100 * CPU_NR). We recommend that you only
increase the LRU size on a few login nodes where users access the file system
interactively.
To clear the LRU on a single client, and as a result flush client cache, without
changing the lru_size value:
If you shrink the LRU size below the number of existing unused locks, then the
unused locks are canceled immediately. Use echo clear to cancel all locks without
changing the value.
Service Description
ost.OSS.ost_io.threads_started=128
■ To set the maximum number of threads (512), run:
ost.OSS.ost_io.threads_max=512
■ To set the maximum thread count to 256 instead of 512 (to avoid overloading the
storage or for an array with requests), run:
ost.OSS.ost_io.threads_max=256
■ To check if the new threads_max setting is active, run:
ost.OSS.ost_io.threads_max=256
Note – Currently, the maximum thread count setting is advisory because Lustre
does not reduce the number of service threads in use, even if that number exceeds
the threads_max value. Lustre does not stop service threads once they are started.
By default, Lustre generates a detailed log of all operations to aid in debugging. The
level of debugging can affect the performance or speed you achieve with Lustre.
Therefore, it is useful to reduce this overhead by turning down the debug level1 to
improve performance. Raise the debug level when you need to collect the logs for
debugging problems. The debugging mask can be set with "symbolic names" instead
of the numerical values that were used in prior releases. The new symbolic format is
shown in the examples below.
Note – All of the commands below must be run as root; note the # nomenclature.
To verify the debug level used by examining the sysctl that controls debugging, run:
# sysctl lnet.debug
lnet.debug = ioctl neterror warning error emerg ha config console
To turn off debugging (except for network error debugging), run this command on
all concerned nodes:
# sysctl -w lnet.debug="neterror"
lnet.debug = neterror
To turn off debugging completely, run this command on all concerned nodes:
# sysctl -w lnet.debug=0
lnet.debug = 0
The flags above collect enough high-level information to aid debugging, but they do
not cause any serious performance impact.
# sysctl -w lnet.debug="warning"
lnet.debug = warning
1. This controls the level of Lustre debugging kept in the internal log buffer. It does not alter the level of
debugging that goes to syslog.
# sysctl lnet.debug
lnet.debug = neterror warning ha
# sysctl -w lnet.debug="-ha"
lnet.debug = -ha
# sysctl lnet.debug
lnet.debug = neterror warning
You can verify and change the debug level using the /proc interface in Lustre. To use
the flags with /proc, run:
# cat /proc/sys/lnet/debug
neterror warning
# cat /proc/sys/lnet/debug
neterror warning ha
# cat /proc/sys/lnet/debug
neterror ha
/proc/sys/lnet/subsystem_debug
This controls the debug logs3 for subsystems (see S_* definitions).
/proc/sys/lnet/debug_path
This indicates the location where debugging symbols should be stored for gdb. The
default is set to /r/tmp/lustre-log-localhost.localdomain.
Note – The above entries only exist when Lustre has already been loaded.
/proc/sys/lnet/panic_on_lbug
This causes Lustre to call ''panic'' when it detects an internal problem (an LBUG);
panic crashes the node. This is particularly useful when a kernel crash dump utility
is configured. The crash dump is triggered when the internal inconsistency is
detected by Lustre.
This allows you to specify the path to the binary which will be invoked when an
LBUG is encountered. This binary is called with four parameters. The first one is the
string ''LBUG''. The second one is the file where the LBUG occurred. The third one is
the function name. The fourth one is the line number in the file.
Note – See also Section 36.6, “llobdstat” on page 36-14 and Section 12.3, “CollectL”
on page 12-8.
The OST .../stats files can be used to track client statistics (client activity) for
each OST. It is possible to get a periodic dump of values from these file (for example,
every 10 seconds), that show the RPC rates (similar to iostat) by using the
llstat.pl tool:
# llstat /proc/fs/lustre/osc/lustre-OST0000-osc/stats
To clear the statistics, give the -c option to llstat.pl. To specify how frequently
the statistics should be cleared (in seconds), use an integer for the -i option. This is
sample output with -c and -i10 options used, providing statistics every 10s):
$ llstat -c -i10 /proc/fs/lustre/ost/OSS/ost_io/stats
/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074103.284895
Name Cur.CountCur.Rate#EventsUnit\ last min avg max stddev
req_waittime8 0 8 [usec] 2078\ 34 259.75 868 317.49
req_qdepth 8 0 8 [reqs] 1\ 0 0.12 1 0.35
req_active 8 0 8 [reqs] 11\ 1 1.38 2 0.52
reqbuf_avail8 0 8 [bufs] 511\ 63 63.88 64 0.35
ost_write 8 0 8 [bytes]1697677\72914212209.6238757991874.29
/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074113.290180
Name Cur.CountCur.Rate#EventsUnit \ lastmin avg max stddev
req_waittime31 3 39 [usec] 30011\ 34 822.79 12245 2047.71
req_qdepth 31 3 39 [reqs] 0\ 0 0.03 1 0.16
req_active 31 3 39 [reqs] 58\ 1 1.77 3 0.74
reqbuf_avail31 3 39 [bufs] 1977\ 63 63.79 64 0.41
ost_write 30 3 38 [bytes]10284679\15019315325.16910694197776.51
/proc/fs/lustre/ost/OSS/ost_io/stats @ 1181074123.325560
Name Cur.CountCur.Rate#Events Unit \ last minavgmax stddev
req_waittime21 2 60 [usec] 14970\ 34784.32122451878.66
Where:
Parameter Description
Cur. Count Number of events of each type sent in the last interval (in this example,
10s)
Cur. Rate Number of events per second in the last interval
#Events Total number of such events since the system started
Unit Unit of measurement for that statistic (microseconds, requests, buffers)
last Average rate of these events (in units/event) for the last interval during
which they arrived. For instance, in the above mentioned case of
ost_destroy it took an average of 736 microseconds per destroy for the 400
object destroys in the previous 10 seconds.
min Minimum rate (in units/events) since the service started
avg Average rate
max Maximum rate
stddev Standard deviation (not measured in all cases)
Parameter Description
req_waittime Amount of time a request waited in the queue before being handled by an
available server thread.
req_qdepth Number of requests waiting to be handled in the queue for this service.
req_active Number of requests currently being handled.
reqbuf_avail Number of unsolicited lnet request buffers for this service.
Parameter Description
ldlm_enqueue Time it takes to enqueue a lock (this includes file open on the MDS)
mds_reint Time it takes to process an MDS modification record (includes create,
mkdir, unlink, rename and setattr)
Note – See also Section 36.6, “llobdstat” on page 36-14 and Section 12.3, “CollectL”
on page 12-8.
The MDT .../stats files can be used to track MDT statistics for the MDS. Here is
sample output for an MDT stats file:
# cat /proc/fs/lustre/mds/*-MDT0000/stats
snapshot_time 1244832003.676892 secs.usecs
open 2 samples [reqs]
close 1 samples [reqs]
getxattr 3 samples [reqs]
process_config 1 samples [reqs]
connect 2 samples [reqs]
disconnect 2 samples [reqs]
statfs 3 samples [reqs]
setattr 1 samples [reqs]
getattr 3 samples [reqs]
llog_init 6 samples [reqs]
notify 16 samples [reqs]
User Utilities
This chapter describes user utilities and includes the following sections:
■ lfs
■ lfs_migrate
■ lfsck
■ Filefrag
■ Mount
■ Handling Timeouts
Note – User utility man pages are found in the lustre/doc folder.
32-1
32.1 lfs
The lfs utility can be used for user configuration routines and monitoring.
Synopsis
lfs
lfs changelog [--follow] <mdtname> [startrec [endrec]]
lfs changelog_clear <mdtname> <id> <endrec>
lfs check <mds|osts|servers>
lfs df [-i] [-h] [--pool]-p <fsname>[.<pool>] [path]
lfs find [[!] --atime|-A [-+]N] [[!] --mtime|-M [-+]N]
[[!] --ctime|-C [-+]N] [--maxdepth|-D N] [--name|-n <pattern>]
[--print|-p] [--print0|-P] [[!] --obd|-O <uuid[s]>]
[[!] --size|-S [+-]N[kMGTPE]] --type |-t {bcdflpsD}]
[[!] --gid|-g|--group|-G <gname>|<gid>]
[[!] --uid|-u|--user|-U <uname>|<uid>]
<dirname|filename>
lfs osts [path]
lfs getstripe [--obd|-O <uuid>] [--quiet|-q] [--verbose|-v]
[--count|-c] [--index|-i | --offset|-o]
[--size|-s] [--pool|-p] [--directory|-d]
[--recursive|-r] <dirname|filename> ...
lfs setstripe [--size|-s stripe_size] [--count|-c stripe_cnt]
[--index|-i|--offset|-o start_ost_index]
[--pool|-p <pool>]
<dirname|filename>
lfs setstripe -d <dir>
lfs poollist <filesystem>[.<pool>]|<pathname>
lfs quota [-q] [-v] [-o obd_uuid|-I ost_idx|-i mdt_idx] [-u <uname>|
-u <uid>|-g <gname>| -g <gid>] <filesystem>
lfs quota -t <-u|-g> <filesystem>
lfs quotacheck [-ug] <filesystem>
lfs quotachown [-i] <filesystem>
lfs quotaon [-ugf] <filesystem>
lfs quotaoff [-ug] <filesystem>
lfs quotainv [-ug] [-f] <filesystem>
Note – In the above example, the <filesystem> parameter refers to the mount
point of the Lustre file system. The default mount point is /mnt/lustre.
Note – The old lfs quota output was very detailed and contained cluster-wide
quota statistics (including cluster-wide limits for a user/group and cluster-wide
usage for a user/group), as well as statistics for each MDS/OST. Now, lfs quota
has been updated to provide only cluster-wide statistics, by default. To obtain the full
report of cluster-wide limits, usage and statistics, use the -v option with lfs quota.
Description
The lfs utility is used to create a new file with a specific striping pattern, determine
the default striping pattern, gather the extended attributes (object numbers and
location) for a specific file, find files with specific attributes, list OST information, or
set quota limits. It can be invoked interactively without any arguments or in a
non-interactive mode with one of the supported arguments.
Option Description
changelog
Shows the metadata changes on an MDT. Start and end points
are optional. The --follow option blocks on new changes; this
option is only valid when run directly on the MDT node.
changelog_clear
Indicates that changelog records previous to <endrec> are no
longer of interest to a particular consumer <id>, potentially
allowing the MDT to free up disk space. An <endrec> of 0
indicates the current last record. Changelog consumers must be
registered on the MDT node using lctl.
check
Displays the status of the MDS or OSTs (as specified in the
command) or all servers (MDS and OSTs).
df [-i] [-h] [--pool|-p <fsname>[.<pool>] [path]
Report file system disk space usage or inode usage (with -i) of
each MDT/OST or a subset of OSTs if a pool is specified with
-p. By default, prints the usage of all mounted Lustre file
systems. Otherwise, if path is specified, prints only the usage of
that file system. If -h is given, the output is printed in
human-readable format, using SI base-2 suffixes for Mega-,
Giga-, Tera-, Peta-, or Exabytes.
find
Searches the directory tree rooted at the given
directory/filename for files that match the given parameters.
--atime
File was last accessed N*24 hours ago. (There is no guarantee
that atime is kept coherent across the cluster.)
--print / --print0
Prints the full filename, followed by a new line or NULL
character correspondingly.
osts [path]
Lists all OSTs for the file system. If a path located on a
Lustre-mounted file system is specified, then only OSTs
belonging to this file system are displayed.
getstripe
Lists striping information for a given filename or directory. By
default, the stripe count, size and offset are returned.
--recursive
Recurses into all sub-directories.
setstripe
Creates a new file or sets the directory default with specific
striping parameters.†
--count stripe_cnt
Number of OSTs over which to stripe a file. A stripe_cnt of 0
uses the file system-wide default stripe count (default is 1). A
stripe_cnt of -1 stripes over all available OSTs, and normally
results in a file with 80 stripes.
--size stripe_size*
Number of bytes to store on an OST before moving to the next
OST. A stripe_size of 0 uses the file system’s default stripe size,
(default is 1 MB). Can be specified with k (KB), m (MB), or g
(GB), respectively.
--index --offset start_ost_index
The OST index (base 10, starting at 0) on which to start striping
for this file. A start_ost_index value of -1 allows the MDS to
choose the starting index. This is the default value, and it means
that the MDS selects the starting OST as it wants. We strongly
recommend selecting this default, as it allows space and load
balancing to be done by the MDS as needed. The
start_ost_index value has no relevance on whether the MDS
will use round-robin or QoS weighted allocation for the
remaining stripes in the file.
--pool <pool>
Name of the pre-defined pool of OSTs (see Section 36.3, “lctl”
on page 36-4) that will be used for striping. The stripe_cnt,
stripe_size and start_ost values are used as well. The start-ost
value must be part of the pool or an error is returned.
setstripe -d
Deletes default striping on the specified directory.
poollist {filesystem} [.poolname]|{pathname}
Lists pools in the file system or pathname or OSTs in file
system.pool.
Examples
Creates a file striped on two OSTs with 128 KB on each stripe.
Deletes a default stripe pattern on a given directory. New files use the default
striping pattern.
Recursively lists all regular files in a given directory more than 30 days old.
Recursively lists all files in a given directory that have objects on OST2-UUID. The
lfs check servers command checks the status of all servers (MDT and OSTs).
$ lfs osts
$ lfs df -h
$ lfs df -i
Checks quotas for user and group. Turns on quotas after making the check.
Sets quotas of user ‘bob’, with a 1 GB block quota hardlimit and a 2 GB block quota
softlimit.
Sets grace times for user quotas: 1000 seconds for block quotas, 1 week and 4 days for
inode quotas.
Lists the pools defined for the mounted Lustre file system /mnt/lustre
Lists the OSTs which are members of the pool my_pool in file system my_fs
Associates a directory with the pool my_pool, so all new files and directories are
created in the pool.
See Also
lctl in Section 36.3, “lctl” on page 36-4
Synopsis
lfs_migrate [-c|-s] [-h] [-l] [-n] [-y] [file|directory ...]
Description
The lfs_migrate utility is a simple tool to assist migration of files between Lustre
OSTs. The utility copies each specified file to a new file, verifies the file contents have
not changed, and then renames the new file to the original filename. This allows
balanced space usage between OSTs, moving files of OSTs that are starting to show
hardware problems (though are still functional) or OSTs that will be discontinued.
Because lfs_migrate is not closely integrated with the MDS, it cannot determine
whether a file is currently open and/or in-use by other applications or nodes. This
makes it UNSAFE for use on files that might be modified by other applications, since
the migrated file is only a copy of the current file. This results in the old file
becoming an open-unlinked file and any modifications to that file are lost.
The current file allocation policies on the MDS dictate where the new files are placed,
taking into account whether specific OSTs have been disabled on the MDS via lctl
(preventing new files from being allocated there), whether some OSTs are overly full
(reducing the number of files placed on those OSTs), or if there is a specific default
file striping for the target directory (potentially changing the stripe count, stripe size,
OST pool, or OST index of a new file).
Option Description
-c
Compares file data after migrate (default value, use -s to disable).
-s
Skips file data comparison after migrate (use -c to enable).
-h
Displays help information.
-l
Migrates files with hard links (skips, by default). Files with multiple hard links are
split into multiple separate files by lfs_migrate, so they are skipped, by default,
to avoid breaking the hard links.
-n
Only prints the names of files to be migrated.
-q
Runs quietly (does not print filenames or status).
-y
Answers 'y' to usage warning without prompting (for scripts).
Examples
$ lfs_migrate /mnt/lustre/file
Rebalances all files in /mnt/lustre/dir.
See Also
lfs in Section 32.1, “lfs” on page 32-2
The e2fsck utility is run on each of the local MDS and OST device file systems and
verifies that the underlying ldiskfs is consistent. After e2fsck is run, lfsck does
distributed coherency checking for the Lustre file system. In most cases, e2fsck is
sufficient to repair any file system issues and lfsck is not required.
Synopsis
lfsck [-c|--create] [-d|--delete] [-f|--force] [-h|--help]
[-l|--lostfound] [-n|--nofix] [-v|--verbose] --mdsdb
mds_database_file --ostdb ost1_database_file [ost2_database_file...]
<filesystem>
Note – As shown, the <filesystem> parameter refers to the Lustre file system mount
point. The default mount point is /mnt/lustre.
Option Description
-c
Creates (empty) missing OST objects referenced by MDS inodes.
-d
Deletes orphaned objects from the file system. Since objects on the OST are often
only one of several stripes of a file, it can be difficult to compile multiple objects
together in a single, usable file.
-h
Prints a brief help message.
-l
Puts orphaned objects into a lost+found directory in the root of the file system.
-n
Performs a read-only check; does not repair the file system.
-v
Verbose operation - more verbosity by specifying the option multiple times.
--mdsdb mds_database_file
MDS database file created by running e2fsck --mdsdb mds_database_file <device>
on the MDS backing device. This is required.
--ostdb ost1_database_file [ost2_database_file...]
OST database files created by running e2fsck --ostdb ost_database_file <device> on
each of the OST backing devices. These are required unless an OST is unavailable, in
which case all objects thereon are considered missing.
Description
The lfsck utility is used to check and repair the distributed coherency of a Lustre file
system. If an MDS or an OST becomes corrupt, run a distributed check on the file
system to determine what sort of problems exist. Use lfsck to correct any defects
found.
For more information on using e2fsck and lfsck, including examples, see
Section 30.5, “Commit on Share” on page 30-15. For information on resolving
orphaned objects, see Section 27.2.1, “Working with Orphaned Objects” on page 27-8.
Synopsis
filefrag [ -belsv ] [ files... ]
Description
The filefrag utility reports the extent of fragmentation in a given file. Initially, filefrag
attempts to obtain extent information using FIEMAP ioctl, which is efficient and fast.
If FIEMAP is not supported, then filefrag uses FIBMAP.
Note – Lustre only supports FIEMAP ioctl. FIBMAP ioctl is not supported.
Option Description
-b
Uses the 1024-byte blocksize for the output. By default, this blocksize is used by
Lustre, since OSTs may use different block sizes.
-e
Uses the extent mode when printing the output.
-l
Displays extents in LUN offset order.
-s
Synchronizes the file before requesting the mapping.
-v
Uses the verbose mode when checking file fragmentation.
Examples
Lists default output.
$ filefrag /mnt/lustre/foo
/mnt/lustre/foo: 6 extents found
When a client performs any remote operation, it gives the server a reasonable
amount of time to respond. If a server does not reply either due to a down network,
hung server, or any other reason, a timeout occurs which requires a recovery.
If a timeout occurs, a message (similar to this one), appears on the console of the
client, and in /var/log/messages:
LustreError: 26597:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout
RPC:/0/0 rc 0
Note – Lustre programming interface man pages are found in the lustre/doc
folder.
33-1
33.1 User/Group Cache Upcall
This section describes user and group upcall.
33.1.1 Name
Use /proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_upcall to look
up a given user’s group membership.
33.1.2 Description
The group upcall file contains the path to an executable that, when installed, is
invoked to resolve a numeric UID to a group membership list. This utility should
complete the mds_grp_downcall_data data structure (see Section 33.1.4, “Data
Structures” on page 33-4) and write it to the
/proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_info pseudo-file.
/*
* permission file format is like this:
* {nid} {uid} {perms}
*
* '*' nid means any nid
* '*' uid means any uid
* the valid values for perms are:
* setuid/setgid/setgrp/rmtacl -- enable corresponding perm
* nosetuid/nosetgid/nosetgrp/normtacl -- disable corresponding perm
* they can be listed together, seperated by ',',
* when perm and noperm are in the same line (item), noperm is
preferential,
* when they are in different lines (items), the latter is
preferential,
* '*' nid is as default perm, and is not preferential.
*/
Currently, rmtacl/normtacl can be ignored (part of security functionality), and
used for remote clients. The /usr/sbin/l_getidentity utility can parse
/etc/lustre/perm.conf to obtain permission mask for specified UID.
■ To avoid repeated upcalls, the MDS caches supplemental group information. Use
/proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_expire to set the
cache time (default is 600 seconds). The kernel waits for the upcall to complete (at
most, 5 seconds) and takes the "failure" behavior as described. Set the wait time in
/proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_acquire_expir
e (default is 15 seconds). Cached entries are flushed by writing to
/proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_flush.
33.1.3 Parameters
■ Name of the MDS service
■ Numeric UID
Synopsis
l_getgroups [-v] [-d|mdsname] uid]
l_getgroups [-v] -s
Description
The group upcall file contains the path to an executable that, when properly
installed, is invoked to resolve a numeric UID to a group membership list. This
utility should complete the mds_grp_downcall_data data structure (see Data
structures) and write it to the /proc/fs/lustre/mds/mds-service/group_info
pseudo-file.
Files
/proc/fs/lustre/mds/mds-service/group_upcall
This chapter describes the llapi library of commands used for setting Lustre file
properties within a C program running in a cluster environment, such as a data
processing or MPI application. The commands described in this chapter are:
■ llapi_file_create
■ llapi_file_get_stripe
■ llapi_file_open
■ llapi_quotactl
■ llapi_path2fid
■ Example Using the llapi Library
Note – Lustre programming interface man pages are found in the lustre/doc
folder.
34-1
34.1 llapi_file_create
Use llapi_file_create to set Lustre properties for a new file.
Synopsis
#include <lustre/liblustreapi.h>
#include <lustre/lustre_user.h>
Description
The llapi_file_create() function sets a file descriptor’s Lustre striping
information. The file descriptor is then accessed with open ().
Option Description
llapi_file_create()
If the file already exists, this parameter returns to ‘EEXIST’.
If the stripe parameters are invalid, this parameter returns to ‘EINVAL’.
stripe_size
This value must be an even multiple of system page size, as shown by getpagesize
(). The default Lustre stripe size is 4MB.
stripe_offset
Indicates the starting OST for this file.
stripe_count
Indicates the number of OSTs that this file will be striped across.
stripe_pattern
Indicates the RAID pattern.
Note – Currently, only RAID 0 is supported. To use the system defaults, set these
values: stripe_size = 0, stripe_offset = -1, stripe_count = 0, stripe_pattern = 0
int stripe_offset = -1
int stripe_pattern = 0
int stripe_pattern = 0;
int rc, fd;
rc = llapi_file_create(tfile,
stripe_size,stripe_offset, stripe_count,stripe_pattern);
Result code is inverted, you may return with ’EINVAL’ or an ioctl error.
if (rc) {
fprintf(stderr,"llapi_file_create failed: %d (%s) 0, rc,
strerror(-rc));
return -1;
}
llapi_file_create closes the file descriptor. You must re-open the descriptor. To
do this, run:
Synopsis
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <liblustre.h>
#include <lustre/lustre_idl.h>
#include <lustre/liblustreapi.h>
#include <lustre/lustre_user.h>
Description
The llapi_file_get_stripe() function returns striping information for a file or
directory path in lum (which should point to a large enough memory region) in one of
the following formats:
struct lov_user_md_v1 {
__u32 lmm_magic;
__u32 lmm_pattern;
__u64 lmm_object_id;
__u64 lmm_object_seq;
__u32 lmm_stripe_size;
__u16 lmm_stripe_count;
__u16 lmm_stripe_offset;
struct lov_user_ost_data_v1 lmm_objects[0];
} __attribute__((packed));
Option Description
lmm_magic
Specifies the format of the returned striping information. LOV_MAGIC_V1 is
used for lov_user_md_v1. LOV_MAGIC_V3 is used for lov_user_md_v3.
lmm_pattern
Holds the striping pattern. Only LOV_PATTERN_RAID0 is possible in this Lustre
version.
lmm_object_id
Holds the MDS object ID.
lmm_object_gr
Holds the MDS object group.
lmm_stripe_size
Holds the stripe size in bytes.
lmm_stripe_count
Holds the number of OSTs over which the file is striped.
lmm_stripe_offset
Holds the OST index from which the file starts.
lmm_pool_name
Holds the OST pool name to which the file belongs.
lmm_objects
An array of lmm_stripe_count members containing per OST file information in
the following format:
struct lov_user_ost_data_v1 {
__u64 l_object_id;
__u64 l_object_seq;
__u32 l_ost_gen;
__u32 l_ost_idx;
} __attribute__((packed));
l_object_id
Holds the OST’s object ID.
l_object_seq
Holds the OST’s object group.
l_ost_gen
Holds the OST’s index generation.
l_ost_idx
Holds the OST’s index in LOV.
Return Values
llapi_file_get_stripe() returns:
0 On success
Errors Description
Examples
#include <sys/vfs.h>
#include <liblustre.h>
#include <lnet/lnetctl.h>
#include <obd.h>
#include <lustre_lib.h>
#include <lustre/liblustreapi.h>
#include <obd_lov.h>
v1 = sizeof(struct lov_user_md_v1) +
LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data_v1);
v3 = sizeof(struct lov_user_md_v3) +
LOV_MAX_STRIPE_COUNT * sizeof(struct lov_user_ost_data_v1);
lum_file = alloc_lum();
if (lum_file == NULL) {
rc = ENOMEM;
goto cleanup;
}
rc = llapi_file_get_stripe(argv[1], lum_file);
if (rc) {
rc = errno;
goto cleanup;
}
/* stripe_size stripe_count */
printf("%d %d\n",
lum_file->lmm_stripe_size,
lum_file->lmm_stripe_count);
cleanup:
if (lum_file != NULL)
free(lum_file);
return rc;
}
Synopsis
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <liblustre.h>
#include <lustre/lustre_idl.h>
#include <lustre/liblustreapi.h>
#include <lustre/lustre_user.h>
Description
The llapi_file_create() call is equivalent to the llapi_file_open call with
flags equal to O_CREAT|O_WRONLY and mode equal to 0644, followed by file close.
Option Description
flags
Can be a combination of O_RDONLY, O_WRONLY, O_RDWR, O_CREAT, O_EXCL,
O_NOCTTY, O_TRUNC, O_APPEND, O_NONBLOCK, O_SYNC, FASYNC,
O_DIRECT, O_LARGEFILE, O_DIRECTORY, O_NOFOLLOW, O_NOATIME.
mode
Specifies the permission bits to be used for a new file when O_CREAT is used.
stripe_size
Specifies stripe size (in bytes). Should be multiple of 64 KB, not exceeding 4 GB.
stripe_offset
Specifies an OST index from which the file should start. The default value is -1.
stripe_count
Specifies the number of OSTs to stripe the file across. The default value is -1.
stripe_pattern
Specifies the striping pattern. In this version of Lustre, only LOV_PATTERN_RAID0
is available. The default value is 0.
Return Values
llapi_file_open() and llapi_file_create() return:
Errors
Errors Description
Synopsis
#include <liblustre.h>
#include <lustre/lustre_idl.h>
#include <lustre/liblustreapi.h>
#include <lustre/lustre_user.h>
int llapi_quotactl(char" " *mnt," " struct if_quotactl" " *qctl)
struct if_quotactl {
__u32 qc_cmd;
__u32 qc_type;
__u32 qc_id;
__u32 qc_stat;
struct obd_dqinfo qc_dqinfo;
struct obd_dqblk qc_dqblk;
char obd_type[16];
struct obd_uuid obd_uuid;
};
struct obd_dqblk {
__u64 dqb_bhardlimit;
__u64 dqb_bsoftlimit;
__u64 dqb_curspace;
__u64 dqb_ihardlimit;
__u64 dqb_isoftlimit;
__u64 dqb_curinodes;
__u64 dqb_btime;
__u64 dqb_itime;
__u32 dqb_valid;
__u32 padding;
};
struct obd_dqinfo {
__u64 dqi_bgrace;
__u64 dqi_igrace;
__u32 dqi_flags;
__u32 dqi_valid;
};
Description
The llapi_quotactl() command manipulates disk quotas on a Lustre file system
mount. qc_cmd indicates a command to be applied to UID qc_id or GID qc_id.
Option Description
LUSTRE_Q_QUOTAON
Turns on quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or
UGQUOTA (both user and group quota). The quota files must exist. They are
normally created with the llapi_quotacheck call. This call is restricted to the
super user privilege.
LUSTRE_Q_QUOTAOFF
Turns off quotas for a Lustre file system. qc_type is USRQUOTA, GRPQUOTA or
UGQUOTA (both user and group quota). This call is restricted to the super user
privilege.
LUSTRE_Q_GETQUOTA
Gets disk quota limits and current usage for user or group qc_id. qc_type is
USRQUOTA or GRPQUOTA. uuid may be filled with OBD UUID string to query
quota information from a specific node. dqb_valid may be set nonzero to query
information only from MDS. If uuid is an empty string and dqb_valid is zero then
cluster-wide limits and usage are returned. On return, obd_dqblk contains the
requested information (block limits unit is kilobyte). Quotas must be turned on
before using this command.
LUSTRE_Q_SETQUOTA
Sets disk quota limits for user or group qc_id. qc_type is USRQUOTA or
GRPQUOTA. dqb_valid must be set to QIF_ILIMITS, QIF_BLIMITS or QIF_LIMITS
(both inode limits and block limits) dependent on updating limits. obd_dqblk must
be filled with limits values (as set in dqb_valid, block limits unit is kilobyte). Quotas
must be turned on before using this command.
LUSTRE_Q_GETINFO
Gets information about quotas. qc_type is either USRQUOTA or GRPQUOTA. On
return, dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace time (in
seconds), dqi_flags is not used by the current Lustre version.
LUSTRE_Q_SETINFO
Sets quota information (like grace times). qc_type is either USRQUOTA or
GRPQUOTA. dqi_igrace is inode grace time (in seconds), dqi_bgrace is block grace
time (in seconds), dqi_flags is not used by the current Lustre version and must be
zeroed.
Return Values
llapi_quotactl() returns:
0 On success
Errors
llapi_quotactl errors are described below.
Errors Description
Synopsis
#include <lustre/liblustreapi.h>
#include <lustre/lustre_user.h>
Description
The llapi_path2fid function returns the FID (sequence : object ID : version) for
the pathname.
Return Values
llapi_path2fid returns:
0 On success
You can set striping from inside programs like ioctl. To compile the sample program,
you need to download libtest.c and liblustreapi.c files from the Lustre
source tree.
#include <stdio.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <stdlib.h>
#include <lustre/liblustreapi.h>
#include <lustre/lustre_user.h>
#define MAX_OSTS 1024
#define LOV_EA_SIZE(lum, num) (sizeof(*lum) + num * sizeof(*lum->lmm_objects))
#define LOV_EA_MAX(lum) LOV_EA_SIZE(lum, MAX_OSTS)
/*
This program provides crude examples of using the liblustre API functions
*/
int open_stripe_file()
{
char *tfile = TESTFILE;
int stripe_size = 65536; /* System default is 4M */
int stripe_offset = -1; /* Start at default */
int stripe_count = MY_STRIPE_WIDTH; /*Single stripe for this
demo*/
int stripe_pattern = 0; /* only RAID 0 at this time
*/
int rc, fd;
/*
*/
rc = llapi_file_create(tfile,
stripe_size,stripe_offset,stripe_count,stripe_pattern);
/* result code is inverted, we may return -EINVAL or an ioctl error.
We borrow an error message from sanity.c
*/
if (rc) {
fprintf(stderr,"llapi_file_create failed: %d (%s) \n", rc,
strerror(-rc));
return -1;
}
/* llapi_file_create closes the file descriptor, we must re-open */
fd = open(tfile, O_CREAT | O_RDWR | O_LOV_DELAY_CREATE, 0644);
if (fd < 0) {
fprintf(stderr, "Can't open %s file: %d (%s)\n", tfile, errno,
strerror(errno));
return -1;
}
return fd;
}
lump = malloc(LOV_EA_MAX(lump));
if (lump == NULL) {
return -1;
}
rc = llapi_file_get_stripe(path, lump);
if (rc != 0) {
fprintf(stderr, "get_stripe failed: %d (%s)\n",errno,
strerror(errno));
return -1;
}
}
/* Ping all OSTs that belong to this filesysem */
int ping_osts()
{
DIR *dir;
struct dirent *d;
char osc_dir[100];
int rc;
sprintf(osc_dir, "/proc/fs/lustre/osc");
dir = opendir(osc_dir);
if (dir == NULL) {
printf("Can't open dir\n");
return -1;
}
while((d = readdir(dir)) != NULL) {
if ( d->d_type == DT_DIR ) {
if (! strncmp(d->d_name, "OSC", 3)) {
printf("Pinging OSC %s ", d->d_name);
rc = llapi_ping("osc", d->d_name);
if (rc) {
printf(" bad\n");
} else {
printf(" good\n");
}
}
}
}
return 0;
}
int main()
{
int file;
int rc;
char filename[100];
char sys_cmd[100];
printf("All done\n");
exit(rc);
}
See Also
llapi_file_create in Section 34.1, “llapi_file_create” on page 34-2
This section describes configuration files and module parameters and includes the
following sections:
■ Introduction
■ Module Options
35-1
35.1 Introduction
LNET network hardware and routing are now configured via module parameters.
Parameters should be specified in the /etc/modprobe.conf file, for example:
The above option specifies that this node should use all the available TCP and Elan
interfaces.
Module parameters are read when the module is first loaded. Type-specific LND
modules (for instance, ksocklnd) are loaded automatically by the LNET module
when LNET starts (typically upon modprobe ptlrpc).
Under Linux 2.4, sysfs is not available, but the LND-specific parameters are
accessible via equivalent paths under /proc.
Important: All old (pre v.1.4.6) Lustre configuration lines should be removed from
the module configuration files and replaced with the following. Make sure that
CONFIG_KMOD is set in your linux.config so LNET can load the following modules
it needs. The basic module files are:
For the following parameters, default option settings are shown in parenthesis.
Changes to parameters marked with a W affect running systems. (Unmarked
parameters can only be set when LNET loads for the first time.) Changes to
parameters marked with Wc only have effect when connections are established
(existing connections are not affected by these changes.)
# lctl
# lctl> net down
■ Remember the lctl ping {nid} command - it is a handy way to check your
LNET configuration.
ip2nets ("") is a string that lists globally-available networks, each with a set of IP
address ranges. LNET determines the locally-available networks from this list by
matching the IP address ranges with the local IPs of a node. The purpose of this
option is to be able to use the same modules.conf file across a variety of nodes on
different networks. The string has the following syntax.
<net-spec> contains enough information to uniquely identify the network and load
an appropriate LND. The LND determines the missing "address-within-network"
part of the NID based on the interfaces it can use.
<iface-list> specifies which hardware interface the network can use. If omitted, all
interfaces are used. LNDs that do not support the <iface-list> syntax cannot be
configured to use particular interfaces and just use what is there. Only a single
instance of these LNDs can exist on a node at any time, and <iface-list> must be
omitted.
<net-match> entries are scanned in the order declared to see if one of the node's IP
addresses matches one of the <ip-range> expressions. If there is a match, <net-spec>
specifies the network to instantiate. Note that it is the first match for a particular
network that counts. This can be used to simplify the match expression for the
general case by placing it after the special cases. For example:
4 nodes on the 134.32.1.* network have 2 interfaces (134.32.1.{4,6,8,10}) but all the rest
have 1.
Note that match-all expressions (For instance, *.*.*.*) effectively mask all other
<net-match> entries specified after them. They should be used with caution.
So a node on the network tcp1 that needs to go through a router to get to the Elan
network:
The hopcount is used to help choose the best path between multiply-routed
configurations.
The expansion is a list enclosed in square brackets. Numeric items in the list may be
a single number, a contiguous range of numbers, or a strided range of numbers. For
example, routes="elan 192.168.1.[22-24]@tcp" says that network elan0 is adjacent
(hopcount defaults to 1); and is accessible via 3 routers on the tcp0 network
(192.168.1.22@tcp, 192.168.1.23@tcp and 192.168.1.24@tcp).
routes="[tcp,vib] 2 [8-14/2]@elan" says that 2 networks (tcp0 and vib0) are accessible
through 4 routers (8@elan, 10@elan, 12@elan and 14@elan). The hopcount of 2 means
that traffic to both these networks will be traversed 2 routers - first one of the routers
specified in this entry, then one more.
Duplicate entries, entries that route to a local network, and entries that specify
routers on a non-local network are ignored.
Equivalent entries are resolved in favor of the route with the shorter hopcount. The
hopcount, if omitted, defaults to 1 (the remote network is adjacent).
It is an error to specify routes to the same destination with routers on different local
networks.
If the target network string contains no expansions, then the hopcount defaults to 1
and may be omitted (that is, the remote network is adjacent). In practice, this is true
for most multi-network configurations. It is an error to specify an inconsistent hop
count for a given target network. This is why an explicit hopcount is required if the
target network string specifies more than one network.
Variable Description
acceptor The acceptor is a TCP/IP service that some LNDs use to establish
communications. If a local network requires it and it has not been
disabled, the acceptor listens on a single port for connection
requests that it redirects to the appropriate local network. The
acceptor is part of the LNET module and configured by the
following options:
• secure - Accept connections only from reserved TCP ports (<
1023).
• all - Accept connections from any TCP port. NOTE: this is
required for liblustre clients to allow connections on
non-privileged ports.
• none - Do not run the acceptor.
accept_port Port number on which the acceptor should listen for connection
(988) requests. All nodes in a site configuration that require an acceptor
must use the same port.
accept_backlog Maximum length that the queue of pending connections may grow
(127) to (see listen(2)).
accept_timeout Maximum time in seconds the acceptor is allowed to block while
(5, W) communicating with a peer.
accept_proto_version Version of the acceptor protocol that should be used by outgoing
connection requests. It defaults to the most recent acceptor protocol
version, but it may be set to the previous version to allow the node
to initiate connections with nodes that only understand that
version of the acceptor protocol. The acceptor can, with some
restrictions, handle either version (that is, it can accept connections
from both 'old' and 'new' peers). For the current version of the
acceptor protocol (version 1), the acceptor is compatible with old
peers if it is only required by a single local network.
Variable Description
timeout Time (in seconds) that communications may be stalled before the
(50,W) LND completes them with failure.
nconnds Sets the number of connection daemons.
(4)
min_reconnectms Minimum connection retry interval (in milliseconds). After a failed
(1000,W) connection attempt, this is the time that must elapse before the first
retry. As connections attempts fail, this time is doubled on each
successive retry up to a maximum of 'max_reconnectms'.
max_reconnectms Maximum connection retry interval (in milliseconds).
(6000,W)
eager_ack Boolean that determines whether the socklnd should attempt to
(0 on linux, flush sends on message boundaries.
1 on darwin,W)
typed_conns Boolean that determines whether the socklnd should use different
(1,Wc) sockets for different types of messages. When clear, all
communication with a particular peer takes place on the same
socket. Otherwise, separate sockets are used for bulk sends, bulk
receives and everything else.
min_bulk Determines when a message is considered "bulk".
(1024,W)
tx_buffer_size, Socket buffer sizes. Setting this option to zero (0), allows the
rx_buffer_size system to auto-tune buffer sizes. WARNING: Be very careful
(8388608,Wc) changing this value as improper sizing can harm performance.
nagle Boolean that determines if nagle should be enabled. It should never
(0,Wc) be set in production systems.
keepalive_idle Time (in seconds) that a socket can remain idle before a keepalive
(30,Wc) probe is sent. Setting this value to zero (0) disables keepalives.
keepalive_intvl Time (in seconds) to repeat unanswered keepalive probes. Setting
(2,Wc) this value to zero (0) disables keepalives.
keepalive_count Number of unanswered keepalive probes before pronouncing
(10,Wc) socket (hence peer) death.
enable_irq_affinity Boolean that determines whether to enable IRQ affinity. The
(0,Wc) default is zero (0).
When set, socklnd attempts to maximize performance by handling
device interrupts and data movement for particular (hardware)
interfaces on particular CPUs. This option is not available on all
platforms. This option requires an SMP system to exist and
produces best performance with multiple NICs. Systems with
multiple CPUs and a single NIC may see increase in the
performance with this parameter disabled.
zc_min_frag Determines the minimum message fragment that should be
(2048,W) considered for zero-copy sends. Increasing it above the platform's
PAGE_SIZE disables all zero copy sends. This option is not
available on all platforms.
Message Buffers
When ptllnd starts up, it allocates and posts sufficient message buffers to allow all
expected peers (set by concurrent_peers) to send one unsolicited message. The
first message that a peer actually sends is a
The maximum message size is set by the max_msg_size module parameter (default
value is 512). This parameter sets the bulk transfer breakpoint. Below this breakpoint,
payload data is sent in the message itself. Above this breakpoint, a buffer descriptor
is sent and the receiver gets the actual payload.
The buffer size is set by the rxb_npages module parameter (default value is 1). The
default conservatively avoids allocation problems due to kernel memory
fragmentation. However, increasing this value to 2 is probably not risky.
The ptllnd also keeps an additional rxb_nspare buffers (default value is 8) posted to
account for full buffers being handled.
Assuming a 4K page size with 10000 peers, 1258 buffers can be expected to be posted
at startup, increasing to a maximum of 10008 as peers that are actually connected. By
doubling rxb_npages halving max_msg_size, this number can be reduced by a
factor of 4.
The ptllnd uses a single portal set by the portal module parameter (default value of
9) for both message and bulk buffers. Message buffers are always attached with
PTL_INS_AFTER and match anything sent with "message" matchbits. Bulk buffers
are always attached with PTL_INS_BEFORE and match only specific matchbits for
that particular bulk transfer.
This scheme assumes that the majority of ME / MDs posted are for "message"
buffers, and that the overhead of searching through the preceding "bulk" buffers is
acceptable. Since the number of "bulk" buffers posted at any time is also dependent
on the bulk transfer breakpoint set by max_msg_size, this seems like an issue worth
measuring at scale.
The ptllnd has a pool of so-called "tx descriptors", which it uses not only for outgoing
messages, but also to hold state for bulk transfers requested by incoming messages.
This pool should scale with the total number of peers.
To enable the building of the Portals LND (ptllnd.ko) configure with this option:
./configure --with-portals=<path-to-portals-headers>
Variable Description
Variable Description
Of the described variables, only hosts is required. It must be the absolute path to the
MXLND hosts file.
For example:
n_waitd (1) sets the number of threads that process completed MX requests (sends
and receives).
max_peers (1024) tells MXLND the upper limit of machines that it will need to
communicate with. This affects how many receives it will pre-post and each receive
will use one page of memory. Ideally, on clients, this value will be equal to the total
number of Lustre servers (MDS and OSS). On servers, it needs to equal the total
number of machines in the storage system. cksum (0) turns on small message
checksums. It can be used to aid in troubleshooting. MX also provides an optional
checksumming feature which can check all messages (large and small). For details,
see the MX README.
ntx (256) is the number of total sends in flight from this machine. In actuality,
MXLND reserves half of them for connect messages so make this value twice as large
as you want for the total number of sends in flight.
credits (8) is the number of in-flight messages for a specific peer. This is part of the
flow-control system in Lustre. Increasing this value may improve performance but it
requires more memory because each message requires at least one page.
board (0) is the index of the Myricom NIC. Hosts can have multiple Myricom NICs
and this identifies which one MXLND should use. This value must match the board
value in your MXLND hosts file for this host.
ep_id (3) is the MX endpoint ID. Each process that uses MX is required to have at
least one MX endpoint to access the MX library and NIC. The ID is a simple index
starting at zero (0). This value must match the endpoint ID value in your MXLND
hosts file for this host.
polling (0) determines whether this host will poll or block for MX request
completions. A value of 0 blocks and any positive value will poll that many times
before blocking. Since polling increases CPU usage, we suggest that you set this to
zero (0) on the client and experiment with different values for servers.
This chapter includes system configuration utilities and includes the following
sections:
■ e2scan
■ l_getidentity
■ lctl
■ ll_decode_filter_fid
■ ll_recover_lost_found_objs
■ llobdstat
■ llog_reader
■ llstat
■ llverdev
■ lshowmount
■ lst
■ lustre_rmmod.sh
■ lustre_rsync
■ mkfs.lustre
■ mount.lustre
■ plot-llstat
■ routerstat
■ tunefs.lustre
■ Additional System Configuration Utilities
Note – The system configuration utilities man pages are found in the lustre/doc
folder.
36-1
36.1 e2scan
The e2scan utility is an ext2 file system-modified inode scan program. The e2scan
program uses libext2fs to find inodes with ctime or mtime newer than a given time
and prints out their pathname. Use e2scan to efficiently generate lists of files that
have been modified. The e2scan tool is included in the e2fsprogs package, located at:
http://downloads.lustre.org/public/tools/e2fsprogs/
Synopsis
e2scan [options] [-f file] block_device
Description
When invoked, the e2scan utility iterates all inodes on the block device, finds
modified inodes, and prints their inode numbers. A similar iterator, using
libext2fs(5), builds a table (called parent database) which lists the parent node for
each inode. With a lookup function, you can reconstruct modified pathnames from
root.
Options
Option Description
Synopsis
l_getidentity {mdtname} {uid}
Description
The group upcall file contains the path to an executable file that, when properly
installed, is invoked to resolve a numeric UID to a group membership list. This
utility should complete the mds_grp_downcall_data structure and write it to the
/proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_info pseudo-file.
Options
Option Description
mdtname
Metadata server target name
uid
User identifier
Files
The l_getidentity files are located at:
/proc/fs/lustre/mdt/${FSNAME}-MDT{xxxx}/identity_upcall
Synopsis
lctl
lctl --device <devno> <command [args]>
Description
The lctl utility can be invoked in interactive mode by issuing the lctl command. After
that, commands are issued as shown below. The most common lctl commands are:
dl
dk
device
network <up/down>
list_nids
ping nid
help
quit
For a complete list of available commands, type help at the lctl prompt. To get
basic help on command meaning and syntax, type help command. Command
completion is activated with the TAB key, and command history is available via the
up- and down-arrow keys.
For non-interactive use, use the second invocation, which runs the command after
connecting to the device.
When the file system is running, use the lctl set_param command to set
temporary parameters (mapping to items in /proc/{fs,sys}/{lnet,lustre}).
The lctl set_param command uses this syntax:
For example:
Many permanent parameters can be set with lctl conf_param. In general, lctl
conf_param can be used to specify any parameter settable in a /proc/fs/lustre file,
with its own OBD device. The lctl conf_param command uses this syntax:
<obd|fsname>.<obdtype>.<proc_file_name>=<value>)
For example:
To get current Lustre parameter settings, use the lctl get_param command with
this syntax:
For example:
To list Lustre parameters that are available to set, use the lctl list_param
command, with this syntax:
For example:
For more informaiton on using lctl to set temporary and permanent parameters,
see Section 13.8.3, “Setting Parameters with lctl” on page 13-9.
Option Description
network <up/down>|<tcp/elan/myrinet>
Starts or stops LNET, or selects a network type for other lctl LNET commands.
list_nids
Prints all NIDs on the local node. LNET must be running.
which_nid <nidlist>
From a list of NIDs for a remote node, identifies the NID on which interface
communication will occur.
ping <nid>
Check’s LNET connectivity via an LNET ping. This uses the fabric appropriate to
the specified NID.
interface_list
Prints the network interface information for a given network type.
peer_list
Prints the known peers for a given network type.
conn_list
Prints all the connected remote NIDs for a given network type.
active_tx
This command prints active transmits. It is only used for the Elan network type.
route_list
Prints the complete routing table.
Device Selection
Option Description
device <devname>
This selects the specified OBD device. All other commands depend on the device
being set.
device_list
Shows the local Lustre OBDs, a/k/a dl.
Option Description
-d <device|fsname>.<parameter>
Deletes a parameter setting (use the default value at the next restart). A null value
for <value> also deletes the parameter setting.
activate
Re-activates an import after the deactivate operation. This setting is only effective
until the next restart (see conf_param).
deactivate
Deactivates an import, in particular meaning do not assign new file stripes to an
OSC. Running lctl deactivate on the MDS stops new objects from being
allocated on the OST. Running lctl deactivate on Lustre clients causes them to
return -EIO when accessing objects on the OST instead of waiting for recovery.
abort_recovery
Aborts the recovery process on a re-starting MDT or OST.
Note – Lustre tunables are not always accessible using the procfs interface, as it is
platform-specific. As a solution, lctl {get,set}_param has been introduced as a
platform-independent interface to the Lustre tunables. Avoid direct references to
/proc/{fs,sys}/{lustre,lnet}. For future portability, use lctl {get,set}_param
instead.
Lustre can emulate a virtual block device upon a regular file. This emulation is
needed when you are trying to set up a swap space via the file.
Option Description
Option Description
changelog_register
Registers a new changelog user for a particular device. Changelog entries are not
purged beyond a registered user’s set point (see lfs changelog_clear).
changelog_deregister <id>
Unregisters an existing changelog user. If the user’s "clear" record number is the
minimum for the device, changelog records are purged until the next minimum.
Debug
Option Description
debug_daemon
Starts and stops the debug daemon, and controls the output filename and size.
debug_kernel [file] [raw]
Dumps the kernel debug buffer to stdout or a file.
debug_file <input> [output]
Converts the kernel-dumped debug log from binary to plain text format.
clear
Clears the kernel debug buffer.
mark <text>
Inserts marker text in the kernel debug buffer.
filter <subsystem id/debug mask>
Filters kernel debug messages by subsystem or mask.
show <subsystem id/debug mask>
Shows specific types of messages.
debug_list <subs/types>
Lists all subsystem and debug types.
modules <path>
Provides GDB-friendly module information.
Option Description
--device
Device to be used for the operation (specified by name or number). See device_list.
--ignore_errors | ignore_errors
Ignores errors during script processing.
Examples
lctl
$ lctl
lctl > dl
0 UP mgc MGC192.168.0.20@tcp btbb24e3-7deb-2ffa-eab0-44dffe00f692 5
1 UP ost OSS OSS_uuid 3
2 UP obdfilter testfs-OST0000 testfs-OST0000_UUID 3
lctl > dk /tmp/log Debug log: 87 lines, 87 kept, 0 dropped.
lctl > quit
See Also
mkfs.lustre in Section 36.14, “mkfs.lustre” on page 36-28
Synopsis
ll_decode_filter_fid object_file [object_file ...]
Description
The ll_decode_filter_fid utility decodes and prints the Lustre OST object ID,
MDT FID, stripe index for the specified OST object(s), which is stored in the
"trusted.fid" attribute on each OST object. This is accessible to ll_decode_filter_fid
when the OST filesystem is mounted locally as type ldiskfs for maintenance.
The "trusted.fid" extended attribute is stored on each OST object when it is first
modified (data written or attributes set), and is not accessed or modified by Lustre
after that time.
The OST object ID (objid) is useful in case of OST directory corruption, though
normally the ll_recover_lost_found_objs(8) utility is able to reconstruct the entire
OST object directory hierarchy. The MDS FID can be useful to determine which MDS
inode an OST object is (or was) used by. The stripe index can be used in conjunction
with other OST objects to reconstruct the layout of a file even if the MDT inode was
lost.
Examples
root@oss1# cd /mnt/ost/lost+found
root@oss1# ll_decode_filter_fid #12345[4,5,8]
#123454: objid=690670 seq=0 parent=[0x751c5:0xfce6e605:0x0]
#123455: objid=614725 seq=0 parent=[0x18d11:0xebba84eb:0x1]
#123458: objid=533088 seq=0 parent=[0x21417:0x19734d61:0x0]
This shows that the three files in lost+found have decimal object IDs - 690670, 614725,
and 533088, respectively. The object sequence number (formerly object group) is 0 for
all current OST objects.
The idx field shows the stripe number of this OST object in the Lustre RAID-0 striped
file.
See Also
ll_recover_lost_found_objs in Section 36.5, “ll_recover_lost_found_objs” on
page 36-12
36.5 ll_recover_lost_found_objs
The ll_recover_lost_found_objs utility helps recover Lustre OST objects (file
data) from a lost and found directory and return them to their correct locations.
Synopsis
$ ll_recover_lost_found_objs [-hv] -d directory
Description
The first time Lustre writes to an object, it saves the MDS inode number and the objid
as an extended attribute on the object, so in case of directory corruption of the OST, it
is possible to recover the objects. Running e2fsck fixes the corrupted OST directory,
but it puts all of the objects into a lost and found directory, where they are
inaccessible to Lustre. Use the ll_recover_lost_found_objs utility to recover all
(or at least most) objects from a lost and found directory and return them to the
O/0/d* directories.
Options
Option Description
Example
ll_recover_lost_found_objs -d /mnt/ost/lost+found
Synopsis
llobdstat ost_name [interval]
Description
The llobdstat utility displays a line of OST statistics for the given ost_name every
interval seconds. It should be run directly on an OSS node. Type CTRL-C to stop
statistics printing.
Example
# llobdstat liane-OST0002 1
/usr/bin/llobdstat on /proc/fs/lustre/obdfilter/liane-OST0002/stats
Processor counters run at 2800.189 MHz
Read: 1.21431e+07, Write: 9.93363e+08, create/destroy: 24/1499, stat:
34, punch: 18
[NOTE: cx: create, dx: destroy, st: statfs, pu: punch ]
Timestamp Read-delta ReadRate Write-delta WriteRate
--------------------------------------------------------
1217026053 0.00MB 0.00MB/s 0.00MB 0.00MB/s
1217026054 0.00MB 0.00MB/s 0.00MB 0.00MB/s
1217026055 0.00MB 0.00MB/s 0.00MB 0.00MB/s
1217026056 0.00MB 0.00MB/s 0.00MB 0.00MB/s
1217026057 0.00MB 0.00MB/s 0.00MB 0.00MB/s
1217026058 0.00MB 0.00MB/s 0.00MB 0.00MB/s
1217026059 0.00MB 0.00MB/s 0.00MB 0.00MB/s st:1
Files
/proc/fs/lustre/obdfilter/<ostname>/stats
Synopsis
llog_reader filename
Description
The llog_reader utility parses the binary format of Lustre's on-disk configuration
logs. Llog_reader can only read logs; use tunefs.lustre to write to them.
To examine a log file on a stopped Lustre server, mount its backing file system as
ldiskfs, then use llog_reader to dump the log file's contents, for example:
To examine the same log file on a running Lustre server, use the ldiskfs-enabled
debugfs utility (called debug.ldiskfs on some distributions) to extract the file, for
example:
Caution – Although they are stored in the CONFIGS directory, mountdata files do
not use the configuration log format and will confuse the llog_reader utility.
See Also
tunefs.lustre in Section 36.18, “tunefs.lustre” on page 36-38
Synopsis
llstat [-c] [-g] [-i interval] stats_file
Description
The llstat utility can display statistics from any of several Lustre statistics files that
share a common format, updated every interval seconds. Use CTRL-C to stop
statistics printing.
Option Description
-c
Clears the statistics file first.
-i interval
Polling period (in seconds).
-g
Graphable output format.
-h
Displays help information.
stats_file
Either the full path to a statistics file or the shorthand: MDS or OST.
Example
Monitors statistics on /proc/fs/lustre/ost/OSS/ost/stats at one (1) second intervals:
llstat -i 1 ost
36.9 llverdev
The llverdev verifies a block device is functioning properly over its full size.
Synopsis
llverdev [-c chunksize] [-f] [-h] [-o offset] [-l] [-p] [-r]
[-t timestamp] [-v] [-w] device
Description
Sometimes kernel drivers or hardware devices have bugs that prevent them from
accessing the full device size correctly, or possibly have bad sectors on disk or other
problems which prevent proper data storage. There are often defects associated with
major system boundaries such as 2^32 bytes, 2^31 sectors, 2^31 blocks, 2^32 blocks,
etc.
The llverdev utility writes and verifies a unique test pattern across the entire
device to ensure that data is accessible after it was written, and that data written to
one part of the disk is not overwriting data on another part of the disk.
It is expected that llverdev will be run on large size devices (TB). It is always better
to run llverdev in verbose mode, so that device testing can be easily restarted from
the point where it was stopped.
Option Description
-c|--chunksize
I/O chunk size in bytes (default value is 1048576).
-f|--force
Forces the test to run without a confirmation that the device will be overwritten and
all data will be permanently destroyed.
-h|--help
Displays a brief help message.
-o offset
Offset (in kilobytes) of the start of the test (default value is 0).
-l|--long
Runs a full check, writing and then reading and verifying every block on the disk.
-p|--partial
Runs a partial check, only doing periodic checks across the device (1 GB steps).
-r|--read
Runs the test in read (verify) mode only, after having previously run the test in -w
mode.
-t timestamp
Sets the test start time as printed at the start of a previously-interrupted test to
ensure that validation data is the same across the entire filesystem (default value is
the current time()).
-v|--verbose
Runs the test in verbose mode, listing each read and write operation.
-w|--write
Runs the test in write (test-pattern) mode (default runs both read and write).
llverdev -v -p /dev/sda
llverdev: permanently overwrite all data on /dev/sda (yes/no)? y
llverdev: /dev/sda is 4398046511104 bytes (4096.0 GB) in size
Timestamp: 1009839028
Current write offset: 4096 kB
Continues an interrupted verification at offset 4096kB from the start of the device,
using the same timestamp as the previous run:
Synopsis
lshowmount [-ehlv]
Description
The lshowmount utility shows the hosts that have Lustre mounted to a server. Ths
utility looks for exports from the MGS, MDS, and obdfilter.
Options
Option Description
-e|--enumerate
Causes lshowmount to list each client mounted on a separate line instead of trying
to compress the list of clients into a hostrange string.
-h|--help
Causes lshowmount to print out a usage message.
-l|--lookup
Causes lshowmount to try to look up the hostname for NIDs that look like IP
addresses.
-v|--verbose
Causes lshowmount to output export information for each service instead of only
displaying the aggregate information for all Lustre services on the server.
Files
/proc/fs/lustre/mgs/<server>/exports/<uuid>/nid
/proc/fs/lustre/mds/<server>/exports/<uuid>/nid
/proc/fs/lustre/obdfilter/<server>/exports/<uuid>/nid
Synopsis
lst
Description
LNET self-test helps site administrators confirm that Lustre Networking (LNET) has
been properly installed and configured. The self-test also confirms that LNET and the
network software and hardware underlying it are performing as expected.
Each LNET self-test runs in the context of a session. A node can be associated with
only one session at a time, to ensure that the session has exclusive use of the nodes
on which it is running. A session is create, controlled and monitored from a single
node; this is referred to as the self-test console.
Any node may act as the self-test console. Nodes are named and allocated to a
self-test session in groups. This allows all nodes in a group to be referenced by a
single name.
Test configurations are built by describing and running test batches. A test batch is a
named collection of tests, with each test composed of a number of individual
point-to-point tests running in parallel. These individual point-to-point tests are
instantiated according to the test type, source group, target group and distribution
specified when the test is added to the test batch.
Modules
To run LNET self-test, load these modules: libcfs, lnet, lnet_selftest and any one of
the klnds (ksocklnd, ko2iblnd...). To load all necessary modules, run modprobe
lnet_selftest, which recursively loads the modules on which lnet_selftest depends.
There are two types of nodes for LNET self-test: the console node and test nodes.
Both node types require all previously-specified modules to be loaded. (The
userspace test node does not require these modules).
Utilities
LNET self-test includes two user utilities, lst and lstclient.
lst is the user interface for the self-test console (run on the console node). It
provides a list of commands to control the entire test system, such as create session,
create test groups, etc.
lstclient is the userspace self-test program which is linked with userspace LNDs
and LNET. A user can invoke lstclient to join a self-test session:
Example Script
This is a sample LNET self-test script which simulates the traffic pattern of a set of
Lustre servers on a TCP network, accessed by Lustre clients on an IB network
(connected via LNET routers), with half the clients reading and half the clients
writing.
#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 192.168.10.[8,10,12-16]@tcp
lst add_group readers 192.168.1.[1-253/2]@o2ib
lst add_group writers 192.168.1.[2-254/2]@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers brw read
check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers brw write
check=full size=4K
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session
Note – The lustre_rmmod.sh utility does not work if Lustre modules are being
used or if you have manually run the lctl network up command.
36.13 lustre_rsync
The lustre_rsync utility synchronizes (replicates) a Lustre file system to a target
file system.
Synopsis
lustre_rsync --source|-s <src> --target|-t <tgt>
--mdt|-m <mdt> [--user|-u <user id>]
[--xattr|-x <yes|no>] [--verbose|-v]
[--statuslog|-l <log>] [--dry-run] [--abort-on-err]
- AND -
■ Verify that the Lustre file system (source) and the replica file system (target) are
identical before the changelog user is registered. If the file systems are discrepant,
use a utility, e.g. regular rsync (not lustre_rsync) to make them identical.
Option Description
--source=<src>
The path to the root of the Lustre file system (source) which will be
synchronized. This is a mandatory option if a valid status log created
during a previous synchronization operation (--statuslog) is not
specified.
--target=<tgt>
The path to the root where the source file system will be synchronized
(target). This is a mandatory option if the status log created during a
previous synchronization operation (--statuslog) is not specified.
This option can be repeated if multiple synchronization targets are
desired.
--mdt=<mdt>
The metadata device to be synchronized. A changelog user must be
registered for this device. This is a mandatory option if a valid status
log created during a previous synchronization operation
(--statuslog) is not specified.
--user=<user id>
The changelog user ID for the specified MDT. To use lustre_rsync,
the changelog user must be registered. For details, see the
changelog_register parameter in the lctl man page. This is a
mandatory option if a valid status log created during a previous
synchronization operation (--statuslog) is not specified.
--statuslog=<log>
A log file to which synchronization status is saved. When
lustre_rsync starts, the state of a previous replication is read from
here. If the status log from a previous synchronization operation is
specified, otherwise mandatory options like --source, --target and
--mdt options may be skipped. By specifying options like --source,
--target and/or --mdt in addition to the --statuslog option,
parameters in the status log can be overriden. Command line options
take precedence over options in the status log.
--xattr <yes|no>
Specifies whether extended attributes (xattrs) are synchronized or not.
The default is to synchronize extended attributes.
NOTE: Disabling xattrs causes Lustre striping information not to be
synchronized.
--verbose
Produces verbose output.
--dry-run
Shows the output of lustre_rsync commands (copy, mkdir, etc.) on
the target file system without actually executing them.
--abort-on-err
Stops processing the lustre_rsync operation if an error occurs. The
default is to continue the operation.
Examples
Register a changelog user for an MDT (e.g., MDT lustre-MDT0000).
$ ssh
$ MDS lctl changelog_register \
--device lustre-MDT0000 -n
cl1
After the file system undergoes changes, synchronize the changes with the target file
system. Only the statuslog name needs to be specified, as it has all the parameters
passed earlier.
$ lustre_rsync --source=/mnt/lustre \
--target=/mnt/target1 --target=/mnt/target2 \
--mdt=lustre-MDT0000 --user=cl1
--statuslog replicate.log
See Also
lctl in Section 36.3, “lctl” on page 36-4
Synopsis
mkfs.lustre <target_type> [options] device
Option Description
--ost
Object Storage Target (OST)
--mdt
Metadata Storage Target (MDT)
--network=net,...
Network(s) to which to restrict this OST/MDT. This option can be repeated as
necessary.
--mgs
Configuration Management Service (MGS), one per site. This service can be
combined with one --mdt service by specifying both types.
When the file system is created, parameters can simply be added as a --param
option to the mkfs.lustre command. See Section 13.8.1, “Setting Parameters with
mkfs.lustre” on page 13-9.
Option Description
--backfstype=fstype
Forces a particular format for the backing file system (ldiskfs).
--comment=comment
Sets a user comment about this disk, ignored by Lustre.
--device-size=KB
Sets the device size for loop devices.
--dryrun
Only prints what would be done; it does not affect the disk.
--failnode=nid,...
Sets the NID(s) of a failover partner. This option can be repeated as
needed.
--fsname=filesystem_name
The Lustre file system of which this service/node will be a part. The
default file system name is “lustre”.
Sets the mount options used when the backing file system is
mounted.
CAUTION: Unlike earlier versions of mkfs.lustre, this version
completely replaces the default mount options with those specified
on the command line, and issues a warning on stderr if any default
mount options are omitted.
The defaults for ldiskfs are:
OST: errors=remount-ro,mballoc,extents;
MGS/MDT: errors=remount-ro,iopen_nopriv,user_xattr
Do not alter the default mount options unless you know what you are
doing.
--network=net,...
Network(s) to which to restrict this OST/MDT. This option can be
repeated as necessary.
--mgsnode=nid,...
Sets the NIDs of the MGS node, required for all targets other than the
MGS.
--param key=value
Sets the permanent parameter key to value value. This option can be
repeated as necessary. Typical options might include:
--param sys.timeout=40
System obd timeout.
--param lov.stripesize=2M
Default stripe size.
--param lov.stripecount=2
Default stripe count.
--param failover.mode=failout
Returns errors instead of waiting for recovery.
--quiet
Prints less information.
--reformat
Reformats an existing Lustre disk.
--stripe_count_hint=stripes
Examples
Creates a combined MGS and MDT for file system testfs on, e.g., node cfs21:
Creates an OST for file system testfs on any node (using the above MGS):
Creates an MDT for file system myfs1 on any node (using the above MGS):
See Also
llobdstat in Section 36.6, “llobdstat” on page 36-14
Synopsis
mount -t lustre [-o options] directory
Description
The mount.lustre utility starts a Lustre client or target service. This program
should not be called directly; rather, it is a helper program invoked through
mount(8), as shown above. Use the umount(8) command to stop Lustre clients and
targets.
There are two forms for the device option, depending on whether a client or a target
service is started:
Option Description
<mgsspec>:/<fsname>
Mounts the Lustre file system named fsname on the client by contacting the
Management Service at mgsspec on the pathname given by directory. The format for
mgsspec is defined below. A mounted client file system appears in fstab(5) and is
usable, like any local file system, and provides a full POSIX-compliant interface.
<disk_device>
Starts the target service defined by the mkfs.lustre command on the physical
disk disk_device. A mounted target service file system is only useful for df(1)
operations and appears in fstab(5) to show the device is in use.
Option Description
<mgsspec>:=<mgsnode>[:<mgsnode>]
The MGS specification may be a colon-separated list of nodes.
<mgsnode>:=<mgsnid>[,<mgsnid>]
Each node may be specified by a comma-separated list of NIDs.
Option Description
flock
Enables flock support, coherent across all client nodes.
localflock
Enables local flock support, using only client-local flock (faster, for applications
that require flock, but do not run on multiple nodes).
noflock
Disables flock support entirely. Applications calling flock get an error. It is up to
the administrator to choose either localflock (fastest, low impact, not coherent
between nodes) or flock (slower, performance impact for use, coherent between
nodes).
user_xattr
Enables get/set of extended attributes by regular users. See the attr(5) manual
page.
nouser_xattr
Disables use of extended attributes by regular users. Root and system processes
can still use extended attributes.
acl
Enables POSIX Access Control List support. See the acl(5) manual page.
noacl
Disables Access Control List support.
Option Description
nosvc
Starts only the MGC (and MGS, if co-located) for a target service, not the actual
service.
nomgs
Starts the MDT with a co-located MGS, without starting the MGS.
exclude=ostlist
Starts a client or MDT with a (colon-separated) list of known inactive OSTs.
abort_recov
Aborts client recovery and starts the target service immediately.
md_stripe_cache_size
Sets the stripe cache size for server-side disk with a striped RAID configuration.
recovery_time_soft=timeout
Allows timeout seconds for clients to reconnect for recovery after a server crash.
This timeout is incrementally extended if it is about to expire and the server is still
handling new connections from recoverable clients. The default soft recovery
timeout is 300 seconds (5 minutes).
recovery_time_hard=timeout
The server is allowed to incrementally extend its timeout, up to a hard maximum
of timeout seconds. The default hard recovery timeout is 900 seconds (15 minutes).
Examples
Starts a client for the Lustre file system testfs at mount point
/mnt/myfilesystem. The Management Service is running on a node reachable
from this client via the cfs21@tcp0 NID.
Starts the Lustre metadata target service from /dev/sda1 on mount point
/mnt/test/mdt.
See Also
mkfs.lustre in Section 36.14, “mkfs.lustre” on page 36-28
36.16 plot-llstat
The plot-llstat utility plots Lustre statistics.
Synopsis
plot-llstat results_filename [parameter_index]
Description
The plot-llstat utility generates a CSV file and instruction files for gnuplot from
the output of llstat. Since llstat is generic in nature, plot-llstat is also a
generic script. The value of parameter_index can be 1 for count per interval, 2 for
count per second (default setting) or 3 for total count.
The plot-llstat utility creates a .dat (CSV) file using the number of operations
specified by the user. The number of operations equals the number of columns in the
CSV file. The values in those columns are equal to the corresponding value of
parameter_index in the output file.
The plot-llstat utility also creates a .scr file that contains instructions for gnuplot
to plot the graph. After generating the .dat and .scr files, the plot-llstat tool
invokes gnuplot to display the graph.
Option Description
results_filename
Output generated by plot-llstat
parameter_index
Value of parameter_index can be:
1 - count per interval
2 - count per second (default setting)
3 - total count
Example
llstat -i2 -g -c lustre-OST0000 > log
plot-llstat log 3
Synopsis
routerstat [interval]
Description
The routerstat utility watches LNET router statistics. If no interval is specified,
then statistics are sampled and printed only one time. Otherwise, statistics are
sampled and printed at the specified interval (in seconds).
Options
The routerstat output includes the following fields:
Option Description
M msgs_alloc(msgs_max)
E errors
S send_count/send_length
R recv_count/recv_length
F route_count/route_length
D drop_count/drop_length
Files
The routerstat utility extracts statistics data from:
/proc/sys/lnet/stats
Synopsis
tunefs.lustre [options] device
Description
tunefs.lustre is used to modify configuration information on a Lustre target disk.
This includes upgrading old (pre-Lustre 1.6) disks. This does not reformat the disk or
erase the target information, but modifying the configuration information can result
in an unusable file system.
Caution – Changes made here only affect a file system when the target is mounted
the next time.
Option Description
--comment=comment
Sets a user comment about this disk, ignored by Lustre.
--dryrun
Only prints what would be done; does not affect the disk.
--erase-params
Removes all previous parameter information.
--failnode=nid,...
Sets the NID(s) of a failover partner. This option can be repeated as needed.
--fsname=filesystem_name
The Lustre file system of which this service will be a part. The default file system
name is “lustre”.
--index=index
Forces a particular OST or MDT index.
--mountfsoptions=opts
Sets the mount options used when the backing file system is mounted.
CAUTION: Unlike earlier versions of tunefs.lustre, this version completely
replaces the existing mount options with those specified on the command line, and
issues a warning on stderr if any default mount options are omitted.
The defaults for ldiskfs are:
OST: errors=remount-ro,mballoc,extents;
MGS/MDT: errors=remount-ro,iopen_nopriv,user_xattr
Do not alter the default mount options unless you know what you are doing.
--network=net,...
Network(s) to which to restrict this OST/MDT. This option can be repeated as
necessary.
--mgs
Adds a configuration management service to this target.
--msgnode=nid,...
Sets the NID(s) of the MGS node; required for all targets other than the MGS.
--nomgs
Removes a configuration management service to this target.
--quiet
Prints less information.
--verbose
Prints more information.
--writeconf
Erases all configuration logs for the file system to which this MDT belongs, and
regenerates them. This is dangerous operation. All clients must be unmounted and
servers for this file system should be stopped. All targets (OSTs/MDTs) must then
be restarted to regenerate the logs. No clients should be started until all targets
have restarted.
Examples
Change the MGS’s NID address. (This should be done on each target disk, since they
should all contact the same MGS.)
See Also
mkfs.lustre in Section 36.14, “mkfs.lustre” on page 36-28
lustre_req_history.sh
llstat.sh
The llstat.sh utility (improved in Lustre 1.6), handles a wider range of /proc
files, and has command line switches to produce more graphable output.
plot-llstat.sh
The plot-llstat.sh utility plots the output from llstat.sh using gnuplot.
vfs_ops_stats
The client vfs_ops_stats utility tracks Linux VFS operation calls into Lustre for a
single PID, PPID, GID or everything.
/proc/fs/lustre/llite/*/vfs_ops_stats
/proc/fs/lustre/llite/*/vfs_track_[pid|ppid|gid]
extents_stats
The client extents_stats utility shows the size distribution of I/O calls from the
client (cumulative and by process).
/proc/fs/lustre/llite/*/extents_stats, extents_stats_per_process
offset_stats
The client offset_stats utility shows the read/write seek activity of a client by
offsets and ranges.
/proc/fs/lustre/llite/*/offset_stats
/proc/fs/lustre/mds|obdfilter/*/exports/
■ Improved MDT statistics
More detailed MDT operations statistics are collected for better profiling.
/proc/fs/lustre/mds/*/stats
loadgen
The Load Generator (loadgen) is a test program designed to simulate large numbers
of Lustre clients connecting and writing to an OST. The loadgen utility is located at
lustre/utils/loadgen (in a build directory) or at /usr/sbin/loadgen (from an
RPM).
Usage
The loadgen utility can be run locally on the OST server machine or remotely from
any LNET host. The device command can take an optional NID as a parameter; if
unspecified, the first local NID found is used.
# cd lustre/utils/
# insmod ../obdecho/obdecho.ko
# ./loadgen
loadgen> h
This is a test program used to simulate large numbers of clients. The
echo obds are used, so the obdecho module must be loaded.
The loadgen utility prints periodic status messages; message output can be
controlled with the verbose command.
To insure a file can be written to (a write cache requirement), OSTs reserve ("grants"),
chunks of space for each newly-created file. A grant may cause an OST to report it is
out of space, even though there is enough space on the disk, because the space is
"reserved" by other files. Loadgen estimates the number of simultaneous open files
as disk size divided by grant size and reports that number when write tests start.
The loadgen utility can start an echo server. On another node, loadgen can specify
the echo server as the device, thus creating a network-only test environment.
loadgen> echosrv
loadgen> dl
0 UP obdecho echosrv echosrv 3
1 UP ost OSS OSS 3
On another node:
Scripting
The threads all perform their actions in non-blocking mode; use the wait command
to block for the idle state. For example:
#!/bin/bash
./loadgen << EOF
device lustre-OST0000
st 1
wr 1 10
wait
quit
EOF
The loadgen utility is intended to grow into a more comprehensive test tool; feature
requests are encouraged. The current feature requests include:
■ Locking simulation
■ Many (echo) clients cache locks for the specified resource at the same time.
■ Many (echo) clients enqueue locks for the specified resource simultaneously.
■ obdsurvey functionality
■ Fold the Lustre I/O kit’s obdsurvey script functionality into loadgen
llog_reader
The llog_reader utility translates a Lustre configuration log into human-readable
form.
Synopsis
llog_reader filename
Description
llog_reader parses the binary format of Lustre’s on-disk configuration logs. It can
only read the logs. Use tunefs.lustre to write to them.
To examine a log file on a stopped Lustre server, mount its backing file system as
ldiskfs, then use llog_reader to dump the log file’s contents. For example:
To examine the same log file on a running Lustre server, use the ldiskfs-enabled
debugfs utility (called debug.ldiskfs on some distributions) to extract the file. For
example:
Caution – Although they are stored in the CONFIGS directory, mountdata files do
not use the config log format and will confuse llog_reader.
See Also
The following utilites are part of the Lustre I/O kit. For more information, see
Chapter 24: Benchmarking Lustre Performance (Lustre I/O Kit).
sgpdd_survey
The sgpdd_survey utility tests 'bare metal' performance, bypassing as much of the
kernel as possible. The sgpdd_survey tool does not require Lustre, but it does
require the sgp_dd package.
obdfilter_survey
The obdfilter_survey utility is a shell script that tests performance of isolated
OSTS, the network via echo clients, and an end-to-end test.
ior-survey
The ior-survey utility is a script used to run the IOR benchmark. Lustre includes
IOR version 2.8.6.
ost_survey
The ost_survey utility is an OST performance survey that tests client-to-disk
performance of the individual OSTs in a Lustre file system.
stats-collect
The stats-collect utility contains scripts used to collect application profiling
information from Lustre clients and servers.
By default, the flock utility is disabled on Lustre. Two modes are available.
local mode In this mode, locks are coherent on one node (a single-node flock), but not
across all clients. To enable it, use -o localflock.
This is a client-mount option.
NOTE: This mode does not impact performance and is appropriate for
single-node databases.
consistent mode In this mode, locks are coherent across all clients.
To enable it, use the -o flock. This is a client-mount option.
CAUTION: This mode affects the performance of the file being flocked
and may affect stability, depending on the Lustre version used. Consider
using a newer Lustre version which is more stable. If the consistent mode
is enabled and no applications are using flock, then it has no effect.
A call to use flock may be blocked if another process is holding an incompatible lock.
Locks created using flock are applicable for an open file table entry. Therefore, a
single process may hold only one type of lock (shared or exclusive) on a single file.
Subsequent flock calls on a file that is already locked converts the existing lock to the
new lock mode.
Example:
$ mount -t lustre –o flock mds@tcp0:/lustre /mnt/client
A
ACL Access Control List - An extended attribute associated with a file which
contains authorization directives.
Administrative A configuration directive given to a cluster to declare that an OST has failed,
OST failure so errors can be immediately returned.
C
CMD Clustered metadata, a collection of metadata targets implementing a single
file system namespace.
Completion Callback An RPC made by an OST or MDT to another system, usually a client, to
indicate that the lock request is now granted.
Configlog An llog file used in a node, or retrieved from a management server over the
network with configuration instructions for Lustre systems at startup time.
Configuration Lock A lock held by every node in the cluster to control configuration changes.
When callbacks are received, the nodes quiesce their traffic, cancel the lock
and await configuration changes after which they reacquire the lock before
resuming normal operation.
Glossary-1
D
Default stripe pattern Information in the LOV descriptor that describes the default stripe count
used for new files in a file system. This can be amended by using a directory
stripe descriptor or a per-file stripe descriptor.
Direct I/O A mechanism which can be used during read and write system calls. It
bypasses the kernel. I/O cache to memory copy of data between kernel and
application memory address spaces.
Directory stripe An extended attribute that describes the default stripe pattern for files
descriptor underneath that directory.
E
EA Extended Attribute. A small amount of data which can be retrieved through
a name associated with a particular inode. Lustre uses EAa to store striping
information (location of file data on OSTs). Examples of extended attributes
are ACLs, striping information, and crypto keys.
Eviction The process of eliminating server state for a client that is not returning to the
cluster after a timeout or if server failures have occurred.
Export The state held by a server for a client that is sufficient to transparently
recover all in-flight operations when a single failure occurs.
Extent Lock A lock used by the OSC to protect an extent in a storage object for
concurrent control of read/write, file size acquisition and truncation
operations.
F
Failback The failover process in which the default active server regains control over
the service.
Failout OST An OST which is not expected to recover if it fails to answer client requests.
A failout OST can be administratively failed, thereby enabling clients to
return errors when accessing data on the failed OST without making
additional network requests.
FID Lustre File Identifier. A collection of integers which uniquely identify a file
or object. The FID structure contains a sequence, identity and version
number.
Fileset A group of files that are defined through a directory that represents a file
system’s start point.
FLDB FID Location Database. This database maps a sequence of FIDs to a server
which is managing the objects in the sequence.
Flight Group Group or I/O transfer operations initiated in the OSC, which is
simultaneously going between two endpoints. Tuning the flight group size
correctly leads to a full pipe.
G
Glimpse callback An RPC made by an OST or MDT to another system, usually a client, to
indicate to tthat an extent lock it is holding should be surrendered if it is not
in use. If the system is using the lock, then the system should report the
object size in the reply to the glimpse callback. Glimpses are introduced to
optimize the acquisition of file sizes.
Group Lock
Group upcall
I
Import The state held by a client to fully recover a transaction sequence after a
server failure and restart.
Intent Lock A special locking operation introduced by Lustre into the Linux kernel. An
intent lock combines a request for a lock, with the full information to
perform the operation(s) for which the lock was requested. This offers the
server the option of granting the lock or performing the operation and
informing the client of the operation result without granting a lock. The use
of intent locks enables metadata operations (even complicated ones), to be
implemented with a single RPC from the client to the server.
Glossary-3
IOV I/O vector. A buffer destined for transport across the network which
contains a collection (a/k/a as a vector) of blocks with data.
K
Kerberos An authentication mechanism, optionally available in an upcoming Lustre
version as a GSS backend.
L
LBUG A bug that Lustre writes into a log indicating a serious system failure.
lfs The Lustre File System configuration tool for end users to set/check file
striping, etc. See Section 32.1, “lfs” on page 32-2.
lfsck Lustre File System Check. A distributed version of a disk file system checker.
Normally, lfsck does not need to be run, except when file systems are
damaged through multiple disk failures and other means that cannot be
recovered using file system journal recovery.
liblustre Lustre library. A user-mode Lustre client linked into a user program for
Lustre fs access. liblustre clients cache no data, do not need to give back
locks on time, and can recover safely from an eviction. They should not
participate in recovery.
Llite Lustre lite. This term is in use inside the code and module names to indicate
that code elements are related to the Lustre file system.
Llog Lustre log. A log of entries used internally by Lustre. An llog is suitable for
rapid transactional appends of records and cheap cancellation of records
through a bitmap.
Llog Catalog Lustre log catalog. An llog with records that each point at an llog. Catalogs
were introduced to give llogs almost infinite size. llogs have an originator
which writes records and a replicator which cancels record (usually through
an RPC), when the records are not needed.
LMV Logical Metadata Volume. A driver to abstract in the Lustre client that it is
working with a metadata cluster instead of a single metadata server.
LND Lustre Network Driver. A code module that enables LNET support over a
particular transport, such as TCP and various kinds of InfiniBand.
Load-balancing MDSs A cluster of MDSs that perform load balancing of on system requests.
Lock Client A module that makes lock RPCs to a lock server and handles revocations
from the server.
Lock Server A system that manages locks on certain objects. It also issues lock callback
requests, calls while servicing or, for objects that are already locked,
completes lock requests.
LOV Logical Object Volume. The object storage analog of a logical volume in a
block device volume management system, such as LVM or EVMS. The LOV
is primarily used to present a collection of OSTs as a single device to the
MDT and client file system drivers.
LOV descriptor A set of configuration directives which describes which nodes are OSS
systems in the Lustre cluster, providing names for their OSTs.
Lustre The name of the project chosen by Peter Braam in 1999 for an object-based
storage architecture. Now the name is commonly associated with the Lustre
file system.
Lustre file A file in the Lustre file system. The implementation of a Lustre file is
through an inode on a metadata server which contains references to a
storage object on OSSs.
Lustre lite A preliminary version of Lustre developed for LLNL in 2002. With the
release of Lustre 1.0 in late 2003, Lustre Lite became obsolete.
Lvfs A library that provides an interface between Lustre OSD and MDD drivers
and file systems; this avoids introducing file system-specific abstractions into
the OSD and MDD drivers.
M
Mballoc Multi-Block-Allocate. Lustre functionality that enables the ldiskfs file system
to allocate multiple blocks with a single request to the block allocator.
Normally, an ldiskfs file system only allocates only one block per request.
MDC MetaData Client - Lustre client component that sends metadata requests via
RPC over LNET to the Metadata Target (MDT).
Glossary-5
MDD MetaData Disk Device - Lustre server component that interfaces with the
underlying Object Storage Device to manage the Lustre file system
namespace (directories, file ownership, attributes).
MDS MetaData Server - Server node that is hosting the Metadata Target (MDT).
MDT Metadata Target. A metadata device made available through the Lustre
meta-data network protocol.
Metadata Write-back A cache of metadata updates (mkdir, create, setattr, other operations) which
Cache an application has performed, but have not yet been flushed to a storage
device or server.
Mountconf The Lustre configuration protocol (introduced in version 1.6) which formats
disk file systems on servers with the mkfs.lustre program, and prepares
them for automatic incorporation into a Lustre cluster.
N
NAL An older, obsolete term for LND.
NID Network Identifier. Encodes the type, network number and network address
of a network interface on a node for use by Lustre.
NIO API A subset of the LNET RPC module that implements a library for sending
large network requests, moving buffers with RDMA.
O
OBD Object Device. The base class of layering software constructs that provides
Lustre functionality.
OBD type Module that can implement the Lustre object or metadata APIs. Examples of
OBD types include the LOV, OSC and OSD.
opencache A cache of open file handles. This is a performance enhancement for NFS.
Orphan objects Storage objects for which there is no Lustre file pointing at them. Orphan
objects can arise from crashes and are automatically removed by an llog
recovery. When a client deletes a file, the MDT gives back a cookie for each
stripe. The client then sends the cookie and directs the OST to delete the
stripe. Finally, the OST sends the cookie back to the MDT to cancel it.
Orphan handling A component of the metadata service which allows for recovery of open,
unlinked files after a server crash. The implementation of this feature retains
open, unlinked files as orphan objects until it is determined that no clients
are using them.
OSC Object Storage Client. The client unit talking to an OST (via an OSS).
OSD Object Storage Device. A generic, industry term for storage devices with
more extended interface than block-oriented devices, such as disks. Lustre
uses this name to describe to a software module that implements an object
storage API in the kernel. Lustre also uses this name to refer to an instance
of an object storage device created by that driver. The OSD device is layered
on a file system, with methods that mimic create, destroy and I/O
operations on file inodes.
OSS Object Storage Server. A server OBD that provides access to local OSTs.
OST Object Storage Target. An OSD made accessible through a network protocol.
Typically, an OST is associated with a unique OSD which, in turn is
associated with a formatted disk file system on the server containing the
storage objects.
P
Pdirops A locking protocol introduced in the VFS by CFS to allow for concurrent
operations on a single directory inode.
pool OST pools allows the administrator to associate a name with an arbitrary
subset of OSTs in a Lustre cluster. A group of OSTs can be combined into a
named pool with unique access permissions and stripe characteristics.
Glossary-7
Portal A concept used by LNET. LNET messages are sent to a portal on a NID.
Portals can receive packets when a memory descriptor is attached to the
portal. Portals are implemented as integers.
PTLRPC An RPC protocol layered on LNET. This protocol deals with stateful servers
and has exactly-once semantics and built in support for recovery.
R
Recovery The process that re-establishes the connection state when a client that was
previously connected to a server reconnects after the server restarts.
Reply The concept of re-executing a server request after the server lost information
in its memory caches and shut down. The replay requests are retained by
clients until the server(s) have confirmed that the data is persistent on disk.
Only requests for which a client has received a reply are replayed.
Re-sent request A request that has seen no reply can be re-sent after a server reboot.
Revocation Callback An RPC made by an OST or MDT to another system, usually a client, to
revoke a granted lock.
Rollback The concept that server state is in a crash lost because it was cached in
memory and not yet persistent on disk.
Root squash A mechanism whereby the identity of a root user on a client system is
mapped to a different identity on the server to avoid root users on clients
gaining broad permissions on servers. Typically, for management purposes,
at least one client system should not be subject to root squash.
S
Storage Object API The API that manipulates storage objects. This API is richer than that of
block devices and includes the create/delete of storage objects, read/write
of buffers from and to certain offsets, set attributes and other storage object
metadata.
Stripe count The number of OSTs holding objects for a RAID0-striped Lustre file.
Striping metadata The extended attribute associated with a file that describes how its data is
distributed over storage objects. See also default stripe pattern.
T
T10 object protocol An object storage protocol tied to the SCSI transport layer. Lustre does not
use T10.
W
Wide striping Strategy of using many OSTs to store stripes of a single file. This obtains
maximum bandwidth to a single file through parallel utilization of many
OSTs.
Glossary-9
Glossary-10 Lustre 2.0 Operations Manual • January 2011
Index
Index-1
troubleshooting with strace, 28-11 I/O, direct, performing, 19-10
directory statahead, using, 31-18 inode number, OST, 5-8
downed routers, 15-2 inode size, MDT, 5-8
installing Lustre, environmental requirements, 8-5
E installing Lustre, HA software, 8-4
e2fsprogs, 8-4 installing Lustre, memory requirements, 5-11
e2scan, 36-2 installing Lustre, required tools / utilities, 8-4
Elan (Quadrics Elan), 2-3 interoperability, 16-2
environmental requirements, 8-5 interpreting
error messages, 26-3 adaptive timeouts, 31-7
error numbers, 26-2
external journal, creating, 6-5 K
key features, 1-3
F
failover, 3-2 L
capabilities, 3-2 l_getidentity, 36-3, 36-17, 36-20
configuration types, 3-3 lctl, 36-4
configuring, 13-12 setting parameters, 13-9
failover and MMP, 20-2 lfs command, 32-2
MDT (active/passive), 3-5
lfsck command, 32-15
OST (active/active), 3-5
ll_recover_lost_found_objs, 36-12
file formats, quotas, 21-12
llapi, 34-16
file readahead, using, 31-18
LNET
file size, maximum, 5-10
starting, 15-3
file striping, 18-2 stopping, 15-4
file system size, maximum, 5-10 LNET self-test
filefrag command, 32-17, 33-4 commands, 23-7
filename length, maximum, 5-11 Load balancing with InfiniBand
flock utility, 36-48 modprobe.conf, 15-5
free space management locking proc entries, 31-26
adjusting weighting between free space and lst, 36-21
location, 18-11 Lustre
round-robin allocator, 18-10 administration, aborting recovery, 14-12
weighted allocator, 18-11 administration, changing a server NID, 14-5
free space, managing, 18-9 administration, determining which machine is
serving an OST, 14-13
H administration, failout / failover mode for
HA software, 8-4 OSTs, 13-5
handling full OSTs, 19-2 administration, finding nodes in the file
handling timeouts, 32-19 system, 14-2
administration, mounting a server, 13-3
administration, mounting a server without
I
Lustre service, 14-3
I/O options
administration, regenerating Lustre
checksums, 19-10
configuration logs, 14-4
I/O tunables, 31-11 administration, removing and restoring
Index-3
failover, 3-5 ost_survey tool, 24-13
inode size, 5-8 OSTs
memory requirements, 5-11 adding, 10-10
metadata replay, 30-6 OSTs and MDTs, maximum, 5-9
minimum OSTs, full, handling, 19-2
stripe size, 5-9
mkfs.lustre, 36-28 P
setting parameters, 13-9 parameters, setting with lctl, 13-9
MMP parameters, setting with mkfs.lustre, 13-9
MMP and failover, 20-2 parameters, setting with tunefs.lustre, 13-9
modprobe.conf, 15-5 pathname length, maximum, 5-11
monitoring performing direct I/O, 19-10
changelogs, 12-2 Perl, 8-4
CollectL, 12-8
plot-llstat, 36-35
Lustre Monitoring Tool, 12-7
pools, OST, 19-6
mount command, 32-19
Portals LND
mount.lustre, 36-32
Linux, 35-10
MX LND, 35-12
proc entries
free space distribution, 31-10
N LNET information, 31-8
network locating file systems and servers, 31-2
bonding, 7-2 locking, 31-26
networks, supported timeouts, 31-3
Elan (Quadrics Elan), 2-3
ra (RapidArray), 2-3 Q
NID, server, changing, 14-5 Quadrics Elan, 2-3
number of clients, maximum, 5-9 quota limits, 21-12
number of files in a directory, maximum, 5-10 quota statistics, 21-14
number of open files, maximum, 5-11 quotas
administering, 21-5
O allocating, 21-8
obdfilter_survey tool, 24-6 creating files, 21-5
orphaned objects, working with, 27-8 enabling, 21-3
OSS file formats, 21-12
memory, determining, 5-13 granted cache, 21-11
service thread count, 25-3 known issues, 21-11
limits, 21-12
OSS read cache, 31-19
statistics, 21-14
OST
working with, 21-2
failover, 3-5
number of inodes, 5-8
removing and restoring, 14-8
R
ra (RapidArray), 2-3
OST block I/O stream, watching, 31-17
RAID
OST pools, 19-6
creating an external journal, 6-5
OST, adding, 14-7 formatting options, 6-3
OST, determining which machine is serving, 14-13 handling degraded arrays, 13-6
Index-5
directory statahead, 31-19
file readahead, 31-18
lockless tunables, 25-5
MDS threads, 25-3
OSS threads, 25-3
root squash, 22-6
U
upgrade
1.6.x to 1.8.x, 16-2
complete file system, 16-3
utilities, third-party
e2fsprogs, 8-4
Perl, 8-4
V
VBR, introduction, 30-13
VBR, tips, 30-14
VBR, working with, 30-14
Version-based recovery (VBR), 30-13
W
weighted allocator, 18-11
weighting, adjusting between free space and
location, 18-11