HIERARCHICAL HOST-BASED STORAGE

Info

Publication number: 20150248443
Type: Application
Filed: Mar 2, 2015
Publication Date: Sep 3, 2015
Inventor: Amit GOLANDER (Tel-Aviv)
Application Number: 14/635,261

Abstract

A method of accessing a memory record in distributed network storage, comprising: storing a plurality of memory records in a plurality of network nodes, each stores a file system segment of a file system mapping the memory records, each file system segment maps a subset of the memory records; receiving, by a storage managing module of a first network node, a request for accessing one of the memory records from an application executed in the first network node; querying a file system segment stored in the first network node for the memory record; when the memory record is missing, querying for an address of a second network node, wherein the memory record is stored in the second network node; and providing said first network node with an access to said memory record at said second network node via a network according to said address.

Description

Description

RELATED APPLICATION

This application claims the benefit of priority under 35 USC 119(e) of U.S. Provisional Patent Application No. 61/946,847 filed on Mar. 2, 2014, the contents of which are incorporated herein by reference in their entirety.

FIELD AND BACKGROUND OF THE INVENTION

The present invention, in some embodiments thereof, relates to a shared file system and, more particularly, but not exclusively, to a shared file system with hierarchical host-based storage.

Direct-attached storage (DAS) is a model in which data is local on a server and benefits from low latency access. However, when multiple servers are connected to a network, the DAS model is: inefficient, because there is no resource sharing between servers; inconvenient since data cannot be shared between processes running on different application servers; and not resilient because data is lost upon a single server failure.

To overcome the weaknesses of DAS, shared storage model was invented. Shared-storage systems store all or most metadata and data on a server, which is typically an over-the-network server and not the same server that runs the application/s that generates and consumes the stored data. This architecture can be seen both in traditional shared storage systems, such as NetApp FAS and/or EMC Isilon, where all of the data is accessed via the network; and/or in host-based storage, such as Redhat Gluster and/or EMC Scale-io, in which application servers also run storage functions, but the data is uniformly distributed across the cluster of servers (so 1/n of the data is accessed locally by each server and the remaining (n−1)/n of the data is accessed via the network).

Another well known variant of shared storage is shared storage with (typically read) caches. In this design the application server includes local storage media (such as a Flash card) that holds data that was recently accessed by the application server. This is typically beneficial for recurring read requests. Caching can be used in front of a traditional shared storage (for example in Linux block layer cache (BCache)), or in front of a host-based storage (for example in VMware vSAN). These caching solutions tend to be block-based solutions—i.e. DAS file system layer on top of a shared block layer.

Finally, some storage protocols such as Hadoop distributed file system (HDFS) and parallel network file system (pNFS), allow for metadata to be served from a centralized shared node, while data is served from multiple nodes. The data (not metadata) is typically uniformly distributed among the nodes for load balancing purposes.

SUMMARY OF THE INVENTION

According to an aspect of some embodiments of the present invention there is provided a method of accessing a memory record in distributed network storage, comprising: storing a plurality of memory records in a plurality of network nodes, each one of the plurality of network nodes storing a plurality of file system segments of a file system mapping the plurality of memory records, each one of the plurality of file system segments maps a subset of the plurality of memory records; receiving, by a storage managing module of a first network node of the plurality of network nodes, a request for accessing one of the plurality of memory records, the request is received from an application executed in the first network node; querying a first file system segment stored in the first network node for the memory record; when the memory record is missing from the first memory records subset, querying for an address of a second network node of the plurality of network nodes, wherein the memory record is stored in a second memory records subset of the second network node; and providing the first network node with an access to the memory record at the second network node via a network according to the address.

Optionally, the providing comprises establishing a direct communication channel between the first network node and the second network node via the network according to the address to provide the access.

Optionally, the querying for the address includes: sending a request to a catalog service via the network; and receiving a reply message from the catalog service, the reply message including the address.

Optionally, the querying for the address includes sending a request to each of the plurality of network nodes to receive the address.

Optionally, the querying for the address includes querying for a last known location of the memory record cached in the first file system segment.

Optionally, the second network node temporarily blocks write access to the memory record for the first network node when the memory record is currently accessed by any other of the plurality of network nodes.

More optionally, the second network node temporarily blocks access to the memory record for the first network node when the memory record is currently written by any other of the plurality of network nodes.

Optionally, a copy of the memory record is also stored in a third of the plurality of network nodes.

Optionally, the method further comprises: when the second network node is unavailable, querying for an address of the third network node; and establishing a direct communication channel between the first network node and the third network node via the network according to the address to provide access to the memory record.

Optionally, a copy of the memory record is also stored in the first network node and may be accessed instead of accessing the memory record at the second network node via the network.

Optionally, the method further comprises, before the querying: querying for an address of a directory containing the memory record; and querying for an address of the memory record in the directory.

Optionally, the memory record includes multiple file segments.

Optionally, the querying for the address includes providing an inode number of the memory record.

Optionally, the querying for the address includes providing a layout number of the memory record.

According to some embodiments of the invention there is provided a computer readable medium comprising computer executable instructions adapted to perform the method.

According to an aspect of some embodiments of the present invention there is provided a system of managing a distributed network storage, comprising: a file system segment stored in a first of a plurality of network nodes, the file system segment is one of a plurality of file system segments of a file system mapping a plurality of memory records; a program store storing a storage managing code; and a processor, coupled to the program store, for implementing the storage managing code, the storage managing code comprising: code to receive an access request to a memory record of the plurality of memory records from an application executed in the first network node; code to query the file system segment for the memory record in the first memory records subset; code to query for an address of a second network node of the plurality of network nodes when the memory record is missing from the first memory records subset, wherein the memory record is stored in a second memory records subset of the second network node; and code to provide the first network node with an access to the memory record at the second network node via a network according to the address.

According to an aspect of some embodiments of the present invention there is provided a distributed network storage system, comprising: a plurality of network nodes connected via a network, each including a storage managing module; a plurality of file system segments of a file system, each stored in one of the plurality of network nodes; a plurality of memory records managed by the plurality of file system segments, wherein each of the plurality of memory records is owned by one of the plurality of network nodes and stored in at least one of the plurality of network nodes; and wherein when an application executed in a first of the plurality of network nodes requests an access to one of the plurality of memory records, and the memory record is missing from a memory records subset stored in the first network node, a storage managing module included in the first network node queries for an address of a second network node of the plurality of network nodes, wherein the memory record is stored in a second memory records subset of the second network node; and providing the first network node with an access to the memory record at the second network node via a network according to the address.

According to an aspect of some embodiments of the present invention there is provided a method of creating a memory record in distributed network storage, comprising: storing a plurality of memory records in a plurality of network nodes, each one of the plurality of network nodes storing a plurality of file system segments of a file system mapping the plurality of memory records, each one of the plurality of file system segments maps a subset of the plurality of memory records; receiving, by a storage managing module of a first network node of the plurality of network nodes, a request for creating a new of the plurality of memory records, the request is received from an application executed in the first network node; creating the memory record in the first network node; and registering the memory record in a catalog service via the network.

Optionally, the creating includes assigning a prefix unique to the first network node to an inode number of the memory record.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.

For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions. Optionally, the data processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media, for storing instructions and/or data. Optionally, a network connection is provided as well. A display and/or a user input device such as a keyboard or mouse are optionally provided as well.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a schematic illustration of a distributed network storage system that includes memory records and managed by a shared file system, according to some embodiments of the present invention;

FIG. 2A is a schematic illustration of an exemplary file system segment stored by a network node, according to some embodiments of the present invention;

FIG. 2B is a schematic illustration of an exemplary file system with distributed architecture representing metadata and data ownership at a certain time across all network nodes, according to some embodiments of the present invention;

FIG. 2C is a schematic illustration of an exemplary file system segment of the file system of FIG. 2B, stored by a network node, according to some embodiments of the present invention;

FIG. 3 is a flowchart schematically representing a method for accessing a memory record in distributed network storage, according to some embodiments of the present invention;

FIG. 4 is a sequence chart schematically representing an exemplary scenario of accessing a memory record in distributed network storage, according to some embodiments of the present invention; and

FIG. 5 is a sequence chart schematically representing an exemplary scenario of creating a file in distributed network storage, according to some embodiments of the present invention.

DESCRIPTION OF EMBODIMENTS OF THE INVENTION

The present invention, in some embodiments thereof, relates to a shared file system and, more particularly, but not exclusively, to a shared file system with hierarchical host-based storage.

Storage media, typically thought of as non-volatile memory such as magnetic hard-disk drive (HDD) or Flash-based solid-state drive (SSD), offers affordable capacity, but at 1,000 to 100,000 times longer latency compared to volatile memory such as dynamic random-access memory (DRAM). Newly developed storage media, such as storage class memory (SCM) which is a form of persistent memory, promises DRAM-like ultra-low latency. When ultra-low latency storage is used, network latency is no longer a relatively insignificant delay like in traditional shared storage architectures. New shared storage architectures are required that minimizes network access and therefore overall network latency.

According to some embodiments of the present invention, there is provided a hierarchical shared file system and methods of managing the file system by distributing segments of the file system to reduce network latency and augmenting local file management into a distributed storage solution. These embodiments are a hybrid between DAS and shared storage. In this system, metadata and data are predicted to be local, and the rest of the shared file system hierarchy is only searched upon a misprediction.

The system includes multiple memory records that are stored in multiple network nodes. Each network node stores a segment of the file system that maps a subset of the memory records stored in that network node. Each memory record, such as a record represented by an inode in Linux or an entry in the master file table in Windows' new technology file system (NTFS), is a directory or a file in the file system or a file segment such as a range of data blocks. Each memory record is owned (e.g. access managed and/or access controlled) by a single network node in the system, at a given time. The owning network node is the only entity in the system that is allowed to commit changes to its memory records.

When the method of accessing a memory record is applied, according to some embodiments of the present invention, a memory record, requested by an application that is executed in one of the network nodes, is first speculated to be owned and therefore stored in a local memory of that network node. When the prediction is correct, only local information is traversed which results in ultra-low latency access. However, when the speculation fails and the memory record is missing, it is searched in other network nodes in the system.

When the file system segment stored in that network node includes cached hints of the last known network node to own the memory record, this last network node is contacted first with a request of the memory record.

Optionally, the search is done by connecting to a catalog service (typically over the network), which replies with the address of the network node that currently owns the desired memory record.

Optionally, the logically centralized catalog service is sharded, so that each subset of the catalog, such as a range of inode numbers or range of hash values of inode numbers, is served by a different network node, for load balancing purposes.

Optionally, the information of memory record ownership is distributed between the network nodes in the system, and the search is done by storing and maintaining hints in the parent directory of each file or by broadcasting the search query to all the network nodes and/or by sequentially querying different nodes and/or group of nodes based on an estimation and/or geographical or contextual proximity.

Finally, a request is sent to the owning network node and access to the memory record is granted. Optionally, the ownership of the memory record is changed, for example due to trends in data consumption and/or failures, and the memory record is transferred to the requesting network node.

Optionally, copies of some or all of the memory records are also stored as a secondary and/or backup in one or more of the other network nodes, that may replace the owning network node, for example in case of failure and/or network traffic load, and create high availability and persistency of the data.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Referring now to the drawings, FIG. 1 is a schematic illustration of a distributed network storage system, such as a cluster of application servers, that includes memory records and managed by a shared file system wherein segments of the file system are stored in a common network node with the records they map, according to some embodiments of the present invention.

The system includes multiple network nodes 110 such as application servers, each storing, in a memory 120, a subset of memory records from all the memory records stored in the system. Each one of network nodes 110 includes a file system segment of the file system, mapping the subset of memory records.

A memory record represents an addressable data unit, such as a file, a directory a layout or a file segment such as a data block. A memory record may be of any offset and size. The data block does not have to reside on memory storage media.

Each memory record is owned by one network node. Owning a memory record means that the owning network node is storing its memory records and is the only entity in the system that is allowed to commit changes to its memory records. Committing changes may includes modifying a file by changing the data (e.g. write system calls), cutting it short (e.g. truncate), adding information (e.g. append) and many other portable operating system interface (POSIX) variants. Similar commit operations are required in a directory for rename, touch, remove and other POSIX commands.

Each of network nodes 110 (such as network nodes 111 and 112) may be, for example, a physical computer such as a mainframe computer, a workstation, a conventional personal computer (PC), a server-class computer or multiple connected computers, and/or a virtual server.

Network nodes 110 are connected via one or more network(s) 101. Network 101 may include, for example, LAN, high performance computing network such as Infiniband and/or any other network. Optionally, network 110 is comprised of a hierarchy of different networks, such as multiple vLANs or a hybrid of LANs connected by WAN.

Memory 120 of each network node 110 may include, for example non-volatile memory (NVM), also known as persistent memory (PM), and/or solid-state drive (SSD), and/or magnetic hard-disk drive (HDD), and optionally optical disks or tape drives. These technologies can be internal or external devices or systems, including memory bricks accessible by the network node 110. Memory 120 may also be made by DRAM that is backed up with supercapacitor and Flash, or other hybrid technologies. In some use cases, such as ephemeral computing (e.g. most cloud services), memory 120 may even be comprised of volatile memory. Memory 120 may also include a partition in a disk drive, or even a file, a logical volume presented by a logical volume manager built on top of one or more disk drives, a persistent media, such as non-volatile memory (NVDIMM or NVRAM), a persistent array of storage blocks, and/or the like. Memory 120 may also be divided to a plurality of volumes, also referred to as partitions or block volumes. When 120 includes non-volatile memory, which is considerably faster than other storage media types, the reduced network latency achieved by the method is more significant.

The file system segment stored by each network node is typically implemented using tree structures. One tree supports lookups, adds and removes of inodes, while another is used to do the same for data blocks per file or directory. Some file systems may be implemented using other structures such as hash tables.

Reference is now made to FIG. 2A, which is a schematic illustration of an exemplary file system segment stored by a network node 111, according to some embodiments of the present invention. Marked entities are owned by network node 111, while the information of ownership of white entities is not owned by network node 111, which may or may not have a hint as to who may own them. In FIG. 2A, the first, second and seventh inodes (representing files or directories) are owned by network node 111. The third to sixth as well as the eighth, tenth and beyond inodes are not, and may even not exist in the cluster.

Note that while the level of indirect and the level of File/Directory are drawn as one entity, they are typically comprised of many smaller entities, just like the inodes level. FIG. 2A implies that the entire information represented by an inode is fully owned by a single node. While a convenient implementation, it is also possible to partition a file into layouts, of fixed or flexible sizes and let different nodes own different layouts.

The information of memory record ownership may be implemented using a different architecture at the second level of the hierarchy, such as a catalog service and partially independent local file systems in the first level of the hierarchy; or using a distributed architecture such as shared (even if hierarchical) file systems. Optionally, the file system segment stored in a network node includes cached hints of the last known network node to own the memory record.

In the catalog architecture, each network node holds and mainly uses a subset of the global metadata, but there is for example a centralized catalog service 102 that provides the owning network node per memory record number. Centralized catalog service 102 may be any kind of network node, as described above. Centralized catalog service 102 may be implemented, for example, by using off-the-shelf key-value store services such as Redis or by implementing a network layer on top of a hash structure, in which the key is the inode number (or some hash applied to it) and the value is the network node identification (ID) (e.g. internet protocol (IP) address). Optionally, the catalog service is sharded or distributed, but logically acts as a centralized one. This too can be implemented using off-the-shelf software such as Cassandra, or by sharding alone. In one embodiment a subset S out of the N network nodes are also used for holding a subset of the ownership information and in order to know which shard [0,1, . . . (S−1)] serves a particular inode number, the modulo of that number divided by S is calculated.

In the distributed architecture there is no catalog service. Instead all network nodes use the same tree root the file system segment, but hold different subsets of that tree.

Reference is now made to FIG. 2B, which is a schematic illustration of an exemplary file system with distributed architecture representing metadata and data ownership at a certain time across all network nodes 110, according to some embodiments of the present invention. Reference is also made to FIG. 2C, which is a schematic illustration of an exemplary file system segment of the file system of FIG. 2B, stored by a network node 111, according to some embodiments of the present invention. Lightly marked entities are owned by network node 111. The ownership of white entities is unknown to a network node 111, while other darker entities are considered as speculated cached hints for performance optimization purposes. Hints are often located at elements higher than or sibling in the hierarchy to the owned elements, and not at the owned elements themselves. Also, hints may be around data which is mirrored in the local network node but owned by another network node.

Optionally, for both hierarchical architectures, copies of some or all of the memory records are stored in one or more of the network nodes, for example in network node 111, in addition to the owning network node. In this case, a secondary and/or backup network node may replace the owning network node, for example in case of failure and/or network traffic load, and create high availability and persistency of the data. This could also be leveraged for load balancing for read-only access requests, when no writing is needed. A particular file, such as a golden image or template may be mirrored multiple times or even to all network nodes, in order to reduce network traffic and/or increase local deduplication ratio.

Optionally, ownership of entire clones and snapshots, or even sets of snapshots (e.g. all snapshots older than snapshot i) can be re-evaluated as a whole, and out-weight the per-file or per layout ownership process. For example, in order to back up and reduce cost, migration of all data older then or belonging to a daily snapshot to a remote and cheaper site (e.g. cloud storage) is possible. For example, when a file is snapshotted every night, it is possible to have all versions, probably in an efficient deduplicated format, reside on the relevant network node. however, at some point in time, for instance when the memory 120 crosses its lowest tier watermark, to decide to move cold data and all snapshots that are more than a week old to a third party storage system, such as a network file system (NFS) server or cloud storage (e.g. AWS S3). In such scenarios the catalog service can point to an Hypertext Transfer Protocol (HTTP) address or an NFS server and path. A shared file system architecture could replace inodes with similar pointers.

Reference is now made to FIG. 3, which is a flowchart schematically representing a method for accessing a memory record in distributed network storage, according to some embodiments of the present invention.

First, as shown in 301, the memory records are stored in network nodes 110 (such as network nodes 111 and 112), to create the file system. The file system may include root, catalog service, tables of nodes, etc. Data and metadata is also created.

Then, as shown in 302, a request for accessing one of the memory records is received by a storage managing module 131 of network node 111 from an application 141 executed in network node 111, for example to read or write B bytes at offset O in a file.

The application may be any software component that accesses, for example for read and/or writes operations, either directly or via libraries, middleware and/or an overlay file system manager.

Then, as shown in 303, the file system segment stored in network node 111 is queried for the requested memory record. This may be done by storage managing module 131 in any known way of memory access.

When the memory record is found locally, as shown in 304, it is accessed and the data is provided to the application. In this case, no network latency is experienced, as no access to network 101 is required.

However, as shown in 305, when the memory record is missing from the records subset stored in network node 111, an address of a network node, such as network node 112, owning the memory record is queried for by storage managing module 131.

In this case, there is an associated added latency for false speculating that the relevant memory record is local. Nevertheless, the latency added by the local search may not be significant when NVM media is used, making it negligible compared to accessing over-the-network servers.

The address may be, for example, an IP address of network node 112, an identification number of network node 112 such as ‘NodeID’ which can be used to calculate or look up the IP, media access control/(MAC), HTTP or any other network address.

Optionally, when the file system segment stored in that network node includes cached hints of the last known network node to own the memory record, network node 111 connects to this last network node with a request of the memory record, before searching other network nodes. This last known network node may be network node 112 still owning and storing the memory record, or may provide the address of network node 112 where the memory record is stored, or may fail and respond that the hint is no longer correct.

Optionally, in catalog architecture, the search is done by connecting to a catalog service, such as a centralized catalog service 102, typically over network 101, which replies with the address of network node 112.

Optionally, in a distributed architecture, the search is done by broadcasting the search query to all the network nodes and/or by sequentially querying different nodes and/or group of nodes based on an estimation and/or geographical or contextual proximity. For example, when the cluster of network nodes is spread over multiple data centers, missing memory records are first searched in the same data center because of the superior local network resources and the higher probability of data sharing. In an opposite example, certain data is expected o be found in another geography at certain hours, for example when two sites at different time zones and in which, the output data of the first team is used by the second team as their input data.

Optionally, when a memory record representing a file or a directory is requested, the memory record representing the parent directory may be is queried before querying for the requested memory record. This may be the direct parent directory and also other directories up in the hierarchy. The memory record representing the parent directory may be locally owned by network node 111, may be owned by network node 112 or may be owned by a different network node. This process may be repeated for any directory structure. When the memory record representing the parent directory is found, it is read and may be locally cached and saved as a future hint.

Finally, as shown in 306, a direct communication channel is established between network node 111 and network node 112 for example via network 101, according to the address. Access to the memory record is then provided to network node 111.

Then, optionally, the information is not read and locally cached, but instead network node 112 makes a decision to either change the ownership of the layout, file or directory, and start a migration process of the memory record and potentially its surrounding information to network node 111, or to perform remote input/output (TO) access protocol with network node 111. The IO protocol may include any over-the-network file-access interfaces/libraries and object store application programming interface (API).

Optionally, network node 112 may block access, temporarily or permanently to the memory record. A temporary block for example, may occur when the memory record is currently locked or accessed by an application 142 or another one of network nodes 110 so network node 112 temporarily blocks read and/or write access for network node 111 (e.g. via a RETRY response). Another type of blocked response may occur if the requested data is corrupted with no means to reconstruct it.

Optionally, when network node 112 is unavailable, network node 111 queries, by storage managing module 131, for an address of another network node having a copy of the memory record is queried for. In the catalog architecture, storage managing module 131 may query via centralized catalog service 102 Also, a copy of the memory record may be located in network node 111, saving the need to traverse network 101.

Optionally, when the system includes copies of the memory record in other network node(s), and providing that these are not only treated as hints that will be validated later on, network node 112 informs of any changes made to the memory record, or at least invalidates it, so that outdated copies may be removed or updated. In catalog architecture, network node 112 may send a notification to centralized catalog service 102 regarding the changes. In a distributed architecture network node 112 records a small number of network nodes to be informed and uses broadcast when that number crosses a threshold.

Reference is now made to FIG. 4, which is a sequence chart schematically representing an exemplary scenario of accessing a memory record in distributed network storage, according to some embodiments of the present invention. The exemplary scenario is demonstrated using the POSIX semantics and Linux implementation (e.g. virtual file system (VFS) and its dentry cache). The exemplary scenario shows different types of local ownership, remote ownership and caching, which are underlined.

In this example, a network node ‘Node_c’ 421 includes an application (App), a VFS, a front end (FE) and a local file system (FS). The front end is optional and/or implementation dependant, as some flexibility exists in the way the front end is implemented, for example, unlike this example, the front end may just be an escape option in the local file system. The FE and FS may both be represented as storage managing module 130 in FIG. 1. Other ways to partition storage managing module 130 exist, such as shown below as optional.

The application requests to open file c that is located in directory b that is in directory a (401) that is under the root. The open function call may include relevant flag argument(s) that can later be passed to NodeID.

The VFS looks for directory a in root, which turns out to be locally cached, and then looks for directory b in directory a, which is not locally cached (402). The lookup request is then transferred to the FE (403) and then to the local FS. The local FS predicts that directory b is locally owned and stored, looks for directory b, finds that it is indeed locally owned and returns it to the FE (404). The FE then returns directory b to the VFS (405), which caches it (406), looks up file c in directory b, finds the inode number (inodeNum) of file c, but the inode itself is not found in the VFS inode cache (406). The open inodeNum c request is then transferred to the FE (406) and to the local FS (407). The local FS predicts that file c is locally owned and stored, tries to open file c but find that it is a misprediction and returns because inodeNum c is not locally owned (408). The FE then connects over the network to the catalog service, ‘Node_cs’ 422, requesting for the value that matches the inodeNum key (409). The catalog service is a key-value store, so it searches and returns the value that matched the inodeNum key. The value is NodeID, i.e. the identity of the network node that owns file c (410). The FE receives the NodeID, checks for its validity, calculates the node address and establishes a direct connection (P2P handshake) to the owning network node based on the NodeID (411). When the ownership is resolved between the network nodes, the open request is complete and returns to the VFS (412), which caches it in its inode cache and returns the file descriptor to the application (413).

Optionally, when a memory record has to be created, for example as requested by application 141, it is created locally in memory 121 of network node 111. Storage managing module 131 assigns a new inode number to the new memory record. The inode number may include a prefix unique to network node 111. Storage managing module 131 then updates the directory containing the new memory record, and when the directory is owned by another network node 112, contacts network node 112 to update. In catalog architecture, network node 111 registers the new memory record in centralized catalog service 102 by connecting to centralized catalog service 102 and providing the new inode number. Similarly, deletion of a memory record is done by the owning network node and the centralized catalog service 102 is updated.

Reference is now made to FIG. 5, which is a sequence chart schematically representing an exemplary scenario of creating a file in distributed network storage, according to some embodiments of the present invention.

When the application requests to open a new file c in directory b that is in directory a (501) that is under the root, the system looks for the directories and file as described above (402-406). However, when file c is not mentioned in directory b and the application requested the open system call using an O_CREAT argument (back in step 501), then the FE is requested to create a new file, a request that continues to the local FS (507). The local FS creates file c locally and assigns a new inodeNum to it (508), typically using a local prefix node number. The local FS updates directory b and returns to the FE (508). The FE then sends the new inodeNum with the NodeID of Node_c to update the catalog Node_cs (509), which registers it by adding the Key-Value pair <inodeNum c, Node_c>(510) and returning acknowledgement to the FE (511), which returns the inode to the VFS, which caches it in the inode cache (412), and returns the file descriptor to the application (413)

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant shared file systems will be developed and the scope of the term shared file system is intended to include all such new technologies a priori.

The terms “comprises”, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”. This term encompasses the terms “consisting of” and “consisting essentially of”.

As used herein, the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise. For example, the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.

The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of “optional” features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

1. A method of accessing a memory record in distributed network storage, comprising:

storing a plurality of memory records in a plurality of network nodes, each one of said plurality of network nodes storing a plurality of file system segments of a file system mapping said plurality of memory records, each one of said plurality of file system segments maps a subset of said plurality of memory records;

receiving, by a storage managing module of a first network node of said plurality of network nodes, a request for accessing one of said plurality of memory records, said request is received from an application executed in said first network node;

querying a first file system segment stored in said first network node for said memory record;

when said memory record is missing from said first memory records subset, querying for an address of a second network node of said plurality of network nodes, wherein said memory record is stored in a second memory records subset of said second network node; and

providing said first network node with an access to said memory record at said second network node via a network according to said address.

2. The method of claim 1, wherein said providing comprises establishing a direct communication channel between said first network node and said second network node via said network according to said address to provide said access.

3. The method of claim 1, wherein said querying for said address includes:

sending a request to a catalog service via said network; and

receiving a reply message from said catalog service, said reply message including said address.

4. The method of claim 1, wherein said querying for said address includes sending a request to each of said plurality of network nodes to receive said address.

5. The method of claim 1, wherein said querying for said address includes querying for a last known location of said memory record cached in said a first file system segment.

6. The method of claim 1, wherein said second network node temporarily blocks write access to said memory record for said first network node when said memory record is currently accessed by any other of said plurality of network nodes.

7. The method of claim 6, wherein said second network node temporarily blocks access to said memory record for said first network node when said memory record is currently written by any other of said plurality of network nodes.

8. The method of claim 1, wherein a copy of said memory record is also stored in a third of said plurality of network nodes.

9. The method of claim 8, further comprising:

when said second network node is unavailable, querying for an address of said third network node; and

establishing a direct communication channel between said first network node and said third network node via said network according to said address to provide access to said memory record.

10. The method of claim 1, wherein a copy of said memory record is also stored in said first network node and may be accessed instead of accessing the memory record at said second network node via said network.

11. The method of claim 1, further comprising, before said querying:

querying for an address of a directory containing said memory record; and

querying for an address of said memory record in said directory.

12. The method of claim 1, wherein said memory record includes multiple file segments.

13. The method of claim 1, wherein said querying for said address includes providing an inode number of said memory record.

14. The method of claim 1, wherein said querying for said address includes providing a layout number of said memory record.

15. A computer readable medium comprising computer executable instructions adapted to perform the method of claim 1.

16. A system of managing a distributed network storage, comprising:

a file system segment stored in a first of a plurality of network nodes, said file system segment is one of a plurality of file system segments of a file system mapping a plurality of memory records;

a program store storing a storage managing code; and

a processor, coupled to said program store, for implementing said storage managing code, the storage managing code comprising: code to receive an access request to a memory record of said plurality of memory records from an application executed in said first network node; code to query said file system segment for said memory record in said first memory records subset; code to query for an address of a second network node of said plurality of network nodes when said memory record is missing form said first memory records subset, wherein said memory record is stored in a second memory records subset of said second network node; and code to provide said first network node with an access to said memory record at said second network node via a network according to said address.

17. A distributed network storage system, comprising:

a plurality of network nodes connected via a network, each including a storage managing module;

a plurality of file system segments of a file system, each stored in one of said plurality of network nodes;

a plurality of memory records managed by said plurality of file system segments, wherein each of said plurality of memory records is owned by one of said plurality of network nodes and stored in at least one of said plurality of network nodes; and

wherein when an application executed in a first of said plurality of network nodes requests an access to one of said plurality of memory records, and said memory record is missing from a memory records subset stored in said first network node, a storage managing module included in said first network node queries for an address of a second network node of said plurality of network nodes, wherein said memory record is stored in a second memory records subset of said second network node; and providing said first network node with an access to said memory record at said second network node via a network according to said address.

18. A method of creating a memory record in distributed network storage, comprising:

storing a plurality of memory records in a plurality of network nodes, each one of said plurality of network nodes storing a plurality of file system segments of a file system mapping said plurality of memory records, each one of said plurality of file system segments maps a subset of said plurality of memory records;

receiving, by a storage managing module of a first network node of said plurality of network nodes, a request for creating a new of said plurality of memory records, said request is received from an application executed in said first network node;

creating said memory record in said first network node; and

registering said memory record in a catalog service via said network.

19. The method of claim 18, wherein said creating includes assigning a prefix unique to said first network node to an inode number of said memory record.