On Incremental File System Development - Stony Brook University

Transcription

On Incremental File System DevelopmentEREZ ZADOK, RAKESH IYER, NIKOLAI JOUKOV, GOPALAN SIVATHANU, ANDCHARLES P. WRIGHTDeveloping file systems from scratch is difficult and error prone. Layered, or stackable, filesystems are a powerful technique to incrementally extend the functionality of existing file systemson commodity OSes at runtime. In this paper, we analyze the evolution of layering from historicalmodels to what is found in four different present day commodity OSes: Solaris, FreeBSD, Linux,and Microsoft Windows. We classify layered file systems into five types based on their functionalityand identify the requirements that each class imposes on the OS. We then present five major designissues that we encountered during our experience of developing over twenty layered file systems onfour OSes. We discuss how we have addressed each of these issues on current OSes, and presentinsights into useful OS and VFS features that would provide future developers more versatilesolutions for incremental file system development.Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—Portability; D.2.13 [Software Engineering]: Reusable Software—Reuse models; D.4.3 [Operating Systems]: File Systems Management—File organization; D.4.7 [Operating Systems]: Organization and DesignGeneral Terms: DesignAdditional Key Words and Phrases: Layered File Systems, Stackable File Systems, VFS, Vnode,I/O Manager, IRP, Extensibility1. INTRODUCTIONData management is a fundamental facility provided by the operating system (OS). Filesystems are tasked with the bulk of data management, including storing data on disk (orover the network) and naming (i.e., translating a user-visible name such as /usr/srcinto an on-disk object). File systems are complex, and it is difficult to enhance them.Furthermore, OS vendors are reluctant to make major changes to a file system, becausefile system bugs have the potential to corrupt all data on a machine. Because file systemdevelopment is so difficult, extending file system functionality in an incremental manneris valuable. Incremental development also makes it possible for a third-party softwaredeveloper to release file system improvements, without developing a whole file systemfrom scratch.Originally, file systems were thoroughly integrated into the OS, and system calls directly invoked file system methods. This architecture made it difficult to add multiple filesystems. The introduction of a virtual node or vnode provided a layer of abstraction thatseparates the core of the OS from file systems [Kleiman 1986]. Each file is represented inPermission to make digital/hard copy of all or part of this material without fee for personal or classroom useprovided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/servernotice, the title of the publication, and its date appear, and notice is given that copying is by permission of theACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specificpermission and/or a fee.c 2006 ACM 1533-3077/2006/0000-0001 5.00ACM Transactions on Storage, Vol. 2, No. 2, May 2006, Pages 1–33.

2·Zadok et al.memory by a vnode. A vnode has an operations vector that defines several operations thatthe OS can call, thereby allowing the OS to add and remove types of file systems at runtime. Most current OSes use something similar to the vnode interface, and the number offile systems supported by the OS has grown accordingly. For example, Linux 2.6 supportsover 30 file systems and many more are maintained outside of the official kernel tree.Clearly defining the interface between the OS and file systems makes interposition possible. A layered, or stackable, file system creates a vnode with its own operations vectorto be interposed on another vnode. Each time one of the layered file system’s operationsis invoked, the layered file system maps its own vnode to a lower-level vnode, and thencalls the lower-level vnode’s operation. To add functionality, the layered file system canperform additional operations before or after the lower-level operation (e.g., encryptingdata before a write or decrypting data after a read). The key advantage of layered filesystems is that they can change the functionality of a commodity OS at runtime so hard-todevelop lower-level file systems do not need to be changed. This is important, because OSdevelopers often resist change, especially to file systems where bugs can cause data loss.Rosenthal was among the first to propose layering as a method of extending file systems [Rosenthal 1990; 1992]. To enable layering, Rosenthal radically changed the VFS internals of SunOS. Each public vnode field was converted into a method; and all knowledgeof vnode types (e.g., directory vs. regular file) was removed from the core OS. Researchersat UCLA independently developed another layering infrastructure [Heidemann and Popek1991; 1994] that placed an emphasis on light-weight layers and extensibility. The originalpioneers of layering envisioned creating building blocks that could be composed togetherto create more sophisticated and rich file systems. For example, the directory-name lookupcache (DNLC) could simply be implemented as a file system layer, which returns resultson a cache hit, but passes operations down on a miss [Skinner and Wong 1993].Layering has not commonly been used to create and compose building-block file systems, but instead has been widely used to add functionality rapidly and portably to existingfile systems. Many applications of layered file system are features that could be implemented as part of the VFS (e.g., unification), but for practical reasons it is easier to developthem as layered file systems. Several OSes have been designed to support layered file systems, including Solaris, FreeBSD, and Windows. Several layered file systems are availablefor Linux, even though it was not originally designed to support them. Many users use layered file systems unknowingly as part of Antivirus solutions [Symantec 2004; Miretskiyet al. 2004], and Windows XP’s system restore feature [Harder 2001]. On Unix, a nulllayer file system is used to provide support for accessing one directory through multiplepaths. When the layer additionally modifies the data, useful new functionality like encryption [Corner and Noble 2002; Halcrow 2004] or compression [Zadok et al. 2001] can beadded. Another class of layered file systems, called fan out, operates directly on top ofseveral lower-level file systems. For example, unification file systems merge the contentsof several directories [Pendry and McKusick 1995; Wright et al. 2006]. Fanout file systemscan also be used for replication, load-balancing, failover, snapshotting, and caching.The authors of this paper have over fifteen years of combined experience developinglayered file systems on four OSes: Solaris, FreeBSD, Linux, and Windows. We havedeveloped more than twenty layered file systems that provide encryption, compression,versioning, tracing, antivirus, unification, snapshotting, replication, checksumming, andmore.ACM Transactions on Storage, Vol. 2, No. 2, May 2006.

On Incremental File System Development·3The rest of this paper is organized as follows. In Section 2 we survey alternative techniques to enhance file system functionality. In Section 3 we describe four models of layeredfile system development. We then proceed to describe five broad classes of layered file systems in Section 4. In Section 5 we describe five general problems and their solutions thatare useful for all types of layered file systems. We conclude in Section 6 with guidingprinciples for future OS and layered file system developers.2. RELATED WORKIn this section we describe alternatives to achieve the extensibility offered by layered filesystems. We discuss four classes of related works based on the level at which extensibilityis achieved: in hardware, in the device driver, at the system-call level, or in user-levelprograms. We have a detailed discussion of layered file system infrastructures in Section 3.Hardware level. Slice [Anderson et al. 2000] is a storage system architecture for highspeed networks with network-attached block storage. Slice interposes a piece of codecalled a switching filter in the network hardware to route packets among a group of servers.Slice appears to the upper level as a single block-oriented storage device. High-level control information (e.g., files) is unavailable to interposition code at the hardware level, andtherefore cannot perform optimizations for specific devices.Semantically-Smart Disk Systems (SDSs) [Sivathanu et al. 2003] attempt to providefile-system–like functionality without modifying the file system. Knowledge of a specificfile system is embedded into the storage device, and the device provides additional functionality that would traditionally be implemented in the file system. Such systems arerelatively easy to deploy, because they do not require modifications to existing file systemcode. Layered file systems share a similar goal in terms of reusing and leveraging existinginfrastructures. Unlike a layered file system, an SDS is closely tied to the format of the filesystem running on top of it, so porting SDSs to new file systems is difficult.Device-driver level. Software RAID and Logical Volume Managers (LVMs) introduceanother layer of abstraction between the storage hardware and the file system. They provideadditional features such as increased reliability and performance, while appearing to thefile system as a simple disk, which makes them easy to deploy in existing infrastructure.For example, on Linux a Cryptoloop devices uses a loopback block driver to encrypt datastored on a disk or in a file. A new file system is then created within the Cryptoloop device.Any file system can run on top of a block device-driver extension. However, block deviceextensions cannot exploit the control information (e.g., names) that is available at the filesystem level.System-call level. SLIC [Ghormley et al. 1998] is a protected extensibility system forOSes that uses interposition techniques to enable the addition of a large class of untrustedextensions to existing code. Several OS extensions can be implemented using SLIC suchas encryption file systems and a protected environment for binary execution. The Interposition Agents toolkit [Jones 1993], developed by Microsoft Research, allows a user’scode to be written in terms of high-level objects provided by this interface. The toolkitwas designed to ease interposing code between the clients and the instances of the systeminterface to facilitate adding new functionality like file reference tracing, customizable filesystem views, etc. to existing systems. Similarly, Mediating Connectors [Balzer and Goldman 1999] is a system call (and library call) wrapper mechanism for Windows NT thatACM Transactions on Storage, Vol. 2, No. 2, May 2006.

4·Zadok et al.allows users to trap API calls.System call interposition techniques rely on the communication channel between userspace and the kernel, and hence cannot handle operations that bypass that channel (e.g.,mmap operations and their associated page faults). Also, interposing at the system calllevel results in overhead for all system calls even if only a subset of kernel components(e.g., the file system) need to be interposed.User level. Gray-box Information and Control Layers (ICL) [Arpaci-Dusseau and ArpaciDusseau 2001] extend the functionality of OSes by acquiring information about their internal state. ICLs provide OS-like functionality without modifying existing OS code.Dust [Burnett et al. 2002] is a direct application of ICLs that uses gray-box knowledgeof the OS’s buffer cache management policy to optimize application performance. For example, if a Web server first services Web pages that are believed to be in the OS buffercache, then both average response time and throughput can be improved.Blaze’s CFS is a cryptographic file system that is implemented as a user-level NFSserver [Blaze 1993]. The OS’s unmodified NFS client mounts the NFS server over theloopback network interface. The SFS toolkit [Maziéres 2001] aims to simplify Unix filesystem extensibility by allowing development of file systems at the user level. Using thetoolkit, one can implement a simple user-level NFS server and redirect local file systemoperations into the user level implementation. The popularity of the SFS toolkit demonstrates that developers have observed the complexity of modifying existing time-tested filesystems. SiRiUS [Goh et al. 2003], a file system for securing remote untrusted storage,and Dabek’s CFS [Dabek et al. 2001], a wide area cooperative file system, were built usingthe SFS toolkit.Filesystem in Userspace [Szeredi 2005], or FUSE, is a hybrid approach that consistsof two parts: (1) a standard kernel-level file system which passes calls to a user-leveldemon, and (2) a library to easily develop file-system–specific FUSE demons. Developingnew file systems with FUSE is relatively simple because the user-level demon can issuenormal system calls (e.g., read) to service a VFS call (e.g., vfs read). The main twodisadvantages of a FUSE file system are that (1) performance is limited by crossing theuser-kernel boundary, and (2) the file system can only use FUSE’s API, which closelymatches the VFS API, whereas kernel file systems may access a richer kernel API (e.g.,for process and memory management).Sun designed and developed Spring as an object-oriented microkernel OS. Spring’sarchitecture allowed various components, including file systems, to be transparently extended with user libraries [Khalidi and Nelson 1993]. Spring’s design was radically different from current commodity OSes. As it was research prototype, it was not deployedin real systems. K42 [Appavoo et al. 2002] is a new OS under development at IBMwhich incorporates innovative mechanisms and modern programming technologies. Various system functionalities can be extended at the user level through libraries by providinga microkernel-like interface. The Exokernel architecture [Engler et al. 1995] implementsminimal components in the kernel and allows user-level applications to customize OS functionality using library OSes.In general, user-level extensions are easy to implement, but their performance is not asgood as kernel extensions because the former involve data copies between the user leveland the kernel level, as well as additional context switches.ACM Transactions on Storage, Vol. 2, No. 2, May 2006.

On Incremental File System Development·53. LAYERING MODELSOn Unix, the idea of layering evolved from the vnode interface [Kleiman 1986], initiallyimplemented in SunOS in 1984. Most of today’s layered file system models use a vnodeinterface. Microsoft Windows, on the other hand, uses a message-passing layering model.We describe three Unix models and the Windows model here; these cover the full rangeof popular or commodity models available. In Section 3.1 we describe Rosenthal’s objectoriented model. In Section 3.2 we describe the model contemporaneously developed atUCLA. In Section 3.3 we describe the layering model found in three modern Unix systems(Linux, FreeBSD, and Solaris). In Section 3.4 we describe the Windows 2000 and Windows XP message-passing layering model. We present a summary table of the featuresoffered by each layering model in Section 3.5.3.1 Rosenthal’s Layering ModelIn 1990, Rosenthal developed an experimental prototype of a new VFS for SunOS [Rosenthal 1990] with the goal of layering vnodes, so that file systems can be composed of building blocks. Rosenthal identified two distinct types of layering. The first, shown in Figure 1(a), is interposition in which a higher-level vnode is called before the lower-level vnode, and can modify the lower-level vnode’s arguments, operation, and results. Today, thisis commonly called linear layering or linear stacking. The second, shown in Figure 1(b)is composition in which a higher-level vnode performs operations on several lower-levelvnodes. Today, this is commonly called fan out.UserUserVnode AVnode AVnode B(a) LinearVnode BUserVnode AVnode C(b) Fan outVnode B(c) Fan inFig. 1. Types of layering. In a linear layer all operations are intercepted by the top-level vnode, A, and A passesit to a single vnode below. In fan-out, all operations go through the top-level vnode, A, but there are two or morelower-level vnodes. In fan-in, some operations go through the top-level vnode, A, before going to the lower-levelvnode, B, but some operations can bypass A and directly access B.Rosenthal introduced two key abstractions to support vnode layering: push and pop forinserting and removing vnodes from a stack. All vnode operations go through the highestlevel vnode. Rosenthal replaced all visible attributes in the vnode structure with methodsso that layered file systems could intercept them. Push inserts a vnode at the top of a stackof vnodes, so that all operations destined for the stack of vnodes go to the new top vnode.The top vnode may in turn call the vnodes that are below it. Pop performs the oppositeoperation: the top vnode from a stack is removed and operations go directly to the lowerlevel vnode. Two new fields were added to the vnode structure: v top and v above.ACM Transactions on Storage, Vol. 2, No. 2, May 2006.

6·Zadok et al.The v above pointer points to the vnode that is directly above this one in the stack.The v top pointer points to the highest level vnode in the stack. All vnode operationsgo through the highest-level vnode, hence every vnode in the stack essentially becomesan alias to the highest-level vnode. Unfortunately, this prevents fan-in access (shown inFigure 1(c)), in which a user process accesses the lower-level file system directly. Fanin access is necessary when applications need to access unmodified data; for example,a backup program should write encrypted data (stored on the lower-level file system) totape, not the unencrypted data (accessible through the top layer). Rosenthal also suggestedseveral layered file system building blocks, but did not implement any of them for hisprototype.Rosenthal proposed a set of requirements for a layered vnode interface [Rosenthal 1992].To support interposition, he concluded that the OS must support replacing a vnode’s operations vector and private data for each interposer. Moreover, all vnode operations mustbe mitigated through the operations vector. To support composition, Rosenthal concludedthat vnodes must support transactions. The vnode interfaces assumes that each operationis atomic [Kleiman 1986]. However, to compose two file systems together, two distinctlower-level vnode operations must be called; thus the overall operation is not atomic.Later, a vnode interface for SunOS based on Rosenthal’s interposition and several example file systems were developed [Skinner and Wong 1993]. Of particular note is that someexisting VFS operations took multiple vnode arguments, making it impossible to determinewhich vnode’s operations vector should handle the operation. For example, link wouldtake two vnode arguments, the parent directory and the vnode to insert into that directory.To make each vnode control all of its operations, Skinner and Wong decomposed theselarger vnode operations into smaller operations to manage vnode link counts, to update directory entries, and to create objects. Skinner and Wong’s prototype dealt with issues thatRosenthal did not, including concurrency, locking, and per-mount operations. They alsodeveloped file system components in a layered manner. Skinner and Wong’s implementation of mount and unmount is a particularly elegant example of push and pop. The OS’sname-lookup code no longer needs special code for mount points. The root vnode of thefile system to be mounted is simply pushed on top of the covered vnode. To unmount a filesystem, its root vnode is simply popped from the stack. The directory-name–lookup cachewas also implemented as an interposition layer.3.2 UCLA’s Layering ModelIn the early 1990s, Heidemann and Popek also developed an infrastructure for layered filesystem development at UCLA [Heidemann and Popek 1991; 1994]. The UCLA modelemphasized general VFS extensibility and the ability to compose many small and efficientbuilding block layers into more sophisticated file system services. The UCLA model suggests that each separate service should be its own layer. For example, UFS should bedivided into at least three different layers: (1) managing disk partitions, (2) a file storagelayer providing arbitrary-length files referenced by a unique identifier (i.e., inode number),and (3) a hierarchical directory component. Breaking up a file system like UFS wouldallow interposition at several points. For example, an intrusion-detection layer could beinserted above the directory component so it has access to names, but a compression layercould be inserted between the naming layer and the file storage layer to avoid the implementation details of hard links.To provide extensibility, the UCLA model does not have a fixed set of operations. InACM Transactions on Storage, Vol. 2, No. 2, May 2006.

On Incremental File System Development·7stead, each file system provides its own set of operations, and the total set of operations isthe union of all file systems’ operations. If a file system does not support a given operation, then a generic routine for that file system is called. This routine could, for example,simply return an “Operation not supported” error. However, for a layered file system, ageneric bypass routine can be used. To provide operations with arbitrary arguments, theUCLA interface transforms all of an operation’s arguments into a single structure of arguments. Along with the structure, meta-data is passed to the operation that identifies eachargument’s offset and type. Using this meta-data, the bypass routine can generically mapits own vnodes to lower-level vnodes and invoke the lower-level operation.The UCLA model uses mount and unmount to instantiate a layered file system. Theexisting mount operation fits well for two reasons: (1) it creates a new subtree of objectswithin the file system name space, and (2) creation of subtrees usually requires an existingfile system object (e.g., to mount a UFS file system you need a device node). A file-systemlayer often accesses an existing directory in much the same way that UFS uses a device.For example, to mount an encryption file system, the path to the encrypted data is used.Caching is essential to provide good performance, but can pose problems if two or morelayers cache an object. For example, if two layers concurrently cache the same page orvnode, and one modifies it, then the other one would read stale data. Similarly, if bothof them modify the page or vnode, then one of them would lose its update. Heidemanndeveloped a prototype general-purpose cache manager that provided coherent access to allfile system layers [Heidemann and Popek 1995]. The UCLA cache manager requires layers to request either shared or exclusive ownership of an object before caching it. Whenrequired, the cache manager revokes the existing layers’ ownership using a callback. Inmost cases, the cache manager automatically determines when two objects represent thesame data. However, if layers may change the semantics of the data, then the cache manager cannot correlate the upper-level and lower-level file’s position. In these cases, thecache manager treats the upper-level and lower-level objects as two distinct objects, andall ownership requests for either object are sent to the layer that changes the semantics.The semantic-changing layer can then invoke the correct callbacks based on its knowledgeof the relationship between the original and transformed data.The Ficus replicated file system and several course projects (e.g., encryption and compression layers) were developed using the UCLA model [Guy et al. 1990].3.3 Layering in Modern Unix SystemsIn this section we describe layering in three modern Unix-based OSes: Solaris, FreeBSD,and Linux. We discuss their pros and cons in terms of ease of development and performance. All three systems use the vnode interface to enable layering functionality, but theyhave key implementation differences that influence development ease and performance.Solaris. The architecture of the Solaris VFS is nearly identical to the classic vnodearchitecture. Each vnode has a fixed operations vector. Each operation must be implemented by every file system (undefined operations are not permitted). Generic operationsfor returning “operation not supported” or in some cases success are available. Mutableattributes such as size, access time, etc. are not part of vnode fields; instead, the vnodeoperations include functions for managing such attributes.The Solaris loopback file system, Lofs [SMCC 1992], is a null pass-through layer thatlayers only on directory vnodes. Thus, for regular files on Lofs, vnode objects are notACM Transactions on Storage, Vol. 2, No. 2, May 2006.

8·Zadok et al.duplicated. Because Lofs does not layer on file vnodes, the data pages of files are notduplicated, resulting in more efficient memory utilization. However, this makes it harderto extend Lofs to provide functionality like encryption, where two different page contentsneed to be maintained at the lower level and the layered level.CacheFS [SunSoft 1994] is a layered fan-out file system in Solaris which can mount ontop of two lower level directories, the source and the cache directory. When files in thesource directory are accessed through CacheFS, they are copied into the root of the cachedirectory and indexed using a numeric identifier.FreeBSD. The FreeBSD vnode interface has extensibility based on the UCLA model.FreeBSD allows dynamic addition of vnode operations. While activating a file system,the kernel registers the set of vnode operations that the file system supports, and builds anoperation vector for each file system that contains the union of all operations supportedby any file system. File systems provide default routines for operations that they do notsupport. FreeBSD uses packed arguments so layers can generically bypass operations asin the UCLA model.FreeBSD’s version of the loopback file system is called nullfs [Pendry and McKusick1995]. It is a simple file system layer and makes no transformations on its arguments, justpassing the requests it receives to the lower file system. FreeBSD’s union mounts [Pendryand McKusick 1995] logically merge the directories that they are mounted on.Compared to Solaris (and Linux), FreeBSD has the most versatile support for layeringat the file system level. Its architecture allows new vnode operations to be created dynamically. Packing function arguments in an argument structure allows layers to bypassoperations that they do not need to intercept easily.Linux. Linux does not have native support for layered file systems. The Linux VFS wasdesigned to make adding file systems relatively simple by separating generic file systemcode from the individual file system implementations. A file system may choose not toimplement a method, and the VFS itself provides a sensible default. For example, a filesystem can use generic routines for most data-oriented operations (i.e., reading from andwriting to files, including via mmap), and only needs to provide two primitive operationsto read and write a block from the file. This trend of extracting more and more abstractionsis on the rise in the new versions of the Linux kernel. Layered file systems are usually notable to use generic methods, because they need to pass the operation to the lower-level filesystem, but generic operations do not make it more difficult to develop layered file systems.On the other hand, providing functionality directly within the VFS makes it more difficultto implement layered file systems. As shown in Figure 2, a layered file system must appearjust like the VFS to the lower-level file system [Zadok and Bădulescu 1999], so it mustreplicate this VFS functionality. Worse, as the VFS changes, the layered file system mustalso change, or subtle bugs could be introduced.On Linux, the vnode object is divided into two components: a dentry which representsthe name of the file and an inode which represents the actual on-disk file. This separationsimplifies support for hard links. On Linux, VFS operation vectors are fixed. Operationscannot be added, and the prototype of existing operations cannot be changed. The LinuxVFS stores mutable values inside the inode object. This makes it more difficult to maintaincache coherency between the inodes of the layered file system and the lower level inodes.It also causes performance degradation in layered file systems as the modified values in theACM Transactions on Storage, Vol. 2, No. 2, May 2006.

On Incremental File System Development·9VFSFile System BehaviorVFS Like BehaviorLayeredFile SystemLower Level File SystemFig. 2. On Unix, layered file systems consist of two halves: the top half must behave like a file system, and thebottom half must behave like the VFS.lower-level objects need to be copied to the upper-level objects after each file operation.The existing VFS API requires modifications to avoid this data redundancy. In particular,the data fields should not be directly accessible from outside a file system. Instead, readsand writes of these private fields should be performed by calling file-system methods. Ourevaluation, described in Appendix A, shows that the overheads of these method calls wouldbe negligible.FiST [Zadok 2001] is a high-level layered file system definition language that we developed. FiST includes Wrapfs [Zadok et al. 1999], a layered vnode interface implementation for Linux, FreeBSD, and Solaris. Wrapfs is a

at UCLA independently developed another layering infrastructure [Heidemann and Popek 1991; 1994] that placed an emphasis on light-weight layers and extensibility. The original pioneers of layering envisioned creating building blocks that could be composed together to create more sophisticated and rich le systems. For example, the directory .