Background¶
To understand what problem mmnpz solves, we will look at a use case. If you are fully familiar with memory maps, you can skip this, and only look at the implementation section.
Use case¶
Assume you are training a machine learning model on short excerpts of a large collection of audio files. The audio files are preprocessed into spectrograms, which are 2-dimensional matrices, with time in the first dimension. On each iteration over the dataset, you want to process one temporal excerpt of each spectrogram.
Naive way¶
Naively, you could save each audio file as a .npy file. To produce excerpts for a training iteration, you load the .npy files in random order and pick random excerpts:
def get_excerpts(filenames: list[str], excerpt_length: float, rng: np.random.Generator):
for idx in rng.permutation(len(filenames)):
x = np.load(filenames[idx])
start = rng.integers(len(x) - excerpt_length)
yield x[start:start + excerpt_length]
This is inefficient: numpy.load() reads the full .npy file from disk into
memory, only to return a short excerpt.
Memory maps¶
To improve on the previous recipe, instead of reading the .npy file into
memory, you can create a read-only memory map of the .npy file by passing
mmap_mode="r":
def get_excerpts(filenames: list[str], excerpt_length: float, rng: np.random.Generator):
for idx in rng.permutation(len(filenames)):
x = np.load(filenames[idx], mmap_mode="r")
start = rng.integers(len(x) - excerpt_length)
yield x[start:start + excerpt_length]
A memory map is a construct provided by the operating system that ties a range
of your process’s memory addresses to a file on the disk. Only when your
process reads any of the addresses, the data is transferred from disk to
physical memory. In this case, the numpy.load() call only establishes
the memory map, and the x[start:start + excerpt_length] creates a view into a
slice of that map. Only when the callee performs computations with the array
contents, the required part of the .npy file is loaded from disk. The operating
system is free to cache parts of previously accessed .npy files if enough main
memory is available.
Memory maps are an elegant way of loading data from disk lazily. All the logic of when to load what part of the data and what to keep in the cache is offloaded to the operating system. However, the above recipe is still a bit inefficient in that it queries the file system for every excerpt to return.
Precreated memory maps¶
To avoid the file system overhead, you may create all memory maps in advance:
maps = [np.load(fn, mmap_mode="r") for fn in filenames]
And then access the maps in the excerpt generator:
def get_excerpts(maps: list[np.typing.ArrayLike], excerpt_length: float, rng: np.random.Generator):
for idx in rng.permutation(len(filenames)):
x = maps[idx]
start = rng.integers(len(x) - excerpt_length)
yield x[start:start + excerpt_length]
However, each memory map requires holding an open file descriptor, and for performance reasons, operating systems usually place restrictions on how many open file descriptors each process can hold. Thus, this recipe does not scale.
Single memory map¶
If you could concatenate all your .npy files into a single file, along with some index that allows you to find each one, you could open a single memory map and then create slices of the memory map to represent each item. This is what mmnpz provides. You can create a .npz file in advance:
with mmnpz.NpzWriter("dataset.npz") as f:
for fn in filenames:
f.write(fn, np.load(fn))
Load it once:
data = mmnpz.load("dataset.npz")
And then use it in the excerpt generator:
def get_excerpts(data: mmnpz.NpzReader, excerpt_length: float, rng: np.random.Generator):
for idx in rng.permutation(len(data)):
x = data[data.files[idx]]
start = rng.integers(len(x) - excerpt_length)
yield x[start:start + excerpt_length]
This recipe retains the advantages of memory maps, but avoids the file system overhead.
Implementation¶
mmnpz chooses to use the .npz format as the container for multiple numpy
arrays. A .npz file is a ZIP file of .npy files. If the .npy files are stored
uncompressed (as when written with numpy.savez() or
mmnpz.NpzWriter), their data is included 1:1 in the .npz file. The
ZIP format also includes a global index specifying the name and location of
each .npy file within.
When you instantiate a mmnpz.NpzReader, it creates a memory map of
the full .npz file and reads its global index to find the names and offsets of
all uncompressed .npy members. When you access a member by name for the first
time, it looks at the associated offset and reads the local ZIP header to find
the starting position of the .npy file. Finally, it parses the header of the
.npy file to find the shape, dtype, memory layout and offset of the actual
array data and creates a corresponding view of the full memory map to return.
By default, this view is cached to speed up future queries. All parsing of ZIP
and .npy headers uses the memory map rather than file descriptors, making the
implementation safe for multithreading and multiprocessing.
Caveats¶
Alignment: ZIP files sequentially store the local header and data of each
member, followed by a global index in the end that usually copies the local
headers. The offset each numpy array starts at thus depends on the sizes of all
local headers and members that came before, and will often not be aligned to
the word size of the array data. It would be possible to fix this by adding
alignment bytes to the
local header extra data, with the small downside that zipfile would
unnecessarily copy the alignment bytes over to the global index, as it does
not distinguish local and global extra data.
I/O overhead: To create a view for a member requested by name,
mmnpz needs to read its .npy header, which requires a local read inside the
.npz file. This incurs some overhead, which is why by default, this step is
delayed until a member is actually requested (at which time the member’s nearby
data will probably be need to read anyway). It would be possible to fix this
by including the .npy header in the global index extra data, with the downside
that zipfile would unnecessarily copy this over to the local header,
as it does not distinguish local and global extra data.
Alternatives¶
mmnpz is by far not the first implementation for this recipe, but one of the simplest. It does not invent a custom format such as mmap.ninja, but uses existing standards. It does not attempt to compete with mightier alternatives such as Hugging Face Datasets, which is based on Apache Arrow and has extra features such as splitting large datasets into multiple disk files or streaming over HTTP. If you already use an alternative, there is no point in switching to mmnpz. If, however, you just need to move your pipeline to something more efficient, mmnpz may be easier to learn, set up and try than others.