The web version only has simple instructions since chapter 04, while the full book has detailed explanations and background info.

0104: fsync

Ensuring writes reach disk

Since data is stored on disk, we must ensure it is actually written to disk. If we only write to a file, a power loss can cause the file to disappear or be filled with 0x00. These are problems a database must solve.

A database guarantees that written data is not lost. This is usually called durability. The guarantee is defined by a successful return to the caller. If the database crashes before returning, the state is uncertain to the caller. But if the caller receives success, it can trust the write not disappearing.

Cache and fsync

Each file write does not directly map to a disk write. The OS has a memory cache. Writes go to the cache first, then are synced to disk later. This allows merging repeated writes and improves throughput. Repeated reads also benefit.

This cache is called the page cache. The page here matches the CPU virtual memory page. A page is the smallest unit of IO, with a fixed size (usually 4K or 16K). You may ask how disk IO cache relates to virtual memory. Look into mmap. Also note that database docs often call B-tree nodes “pages”. Do not confuse them.

Besides the page cache, disks may also have their own RAM cache. To ensure data reaches disk, an operation must flush all cache layers and wait for completion. This is the fsync syscall on Linux. In Go, call it via Sync() on *os.File.

func (log *Log) Write(ent *Entry) error {
    if _, err := log.fp.Write(ent.Encode()); err != nil {
        return err
    }
    return log.fp.Sync() // fsync
}

Windows has a corresponding operation, so Sync() also applies.

Some app-level file APIs also have caches that need flushing, such as fflush() in C. Go file IO maps directly to OS APIs and has no extra cache.

fsync the parent directory

On Linux, fsync ensures file data is written, but does not ensure the file itself exists. This is because a file is recorded by its parent directory. If a directory entry is added (file creation) but not written to disk before power loss, the file cannot be reached, even if its data is on disk. To fix this, call fsync on the directory.

That is: creating files, renaming files, and deleting files all require fsync on the containing directory.

This is Unix-specific. Windows does not need this. The Go standard library has no method for fsyncing a directory, so you must invoke syscalls directly:

func createFileSync(file string) (*os.File, error) {
    fp, err := os.OpenFile(file, os.O_RDWR|os.O_CREATE, 0o644)
    if err != nil {
        return nil, err
    }
    if err = syncDir(path.Base(file)); err != nil {
        _ = fp.Close()
        return nil, err
    }
    return fp, err
}

func syncDir(file string) error {
    flags := os.O_RDONLY | syscall.O_DIRECTORY
    dirfd, err := syscall.Open(path.Dir(file), flags, 0o644)
    if err != nil {
        return err
    }
    defer syscall.Close(dirfd)
    return syscall.Fsync(dirfd)
}

Replace os.OpenFile() with this function:

func (log *Log) Open() (err error) {
    log.fp, err = createFileSync(log.FileName)
    return err
}

Unix basics

open opens or creates a file and returns a number on success. This number identifies the file and is used for later operations, such as fsync. This number is called a file descriptor (fd).

A file descriptor is not limited to files. It can refer to other OS resources, such as network sockets or directories. On Unix, open can open directories, though the returned fd cannot be read or written like a file. All fds must be closed.

You can read the os.File source in the standard library. It mainly wraps various syscalls.

Summary

Durability is achieved via fsync. However, this is still not enough for power loss cases: a log record may be only half written before a crash (atomicity). We’ll handle this next.

CodeCrafters.io has similar courses in many programming languages, including build your own Redis, SQLite, Docker, etc. It’s worth checking out.