Rethinking Vis' Text Management Data Structure

When initiating vis the goal was to build a simple, elegant and universal editor which copes with arbitrary files, including large, binary or single-line ones and features unlimited history for undo/redo support.

The underlying piece chain text management data structure was initially chosen because it is simple, independent of file size, reasonably performant (assuming few and localized changes) and not in widespread use.

It is basically a double linked list of modifications with a caching mechanism to merge contiguous changes. The first node, representing the initial file content, can be backed using mmap(2) making optimal use of the operating system’s virtual memory system, resulting in constant load times. All subsequent editing operations scale linearly in the number of nodes i.e. the number of non contiguous modifications.

With the introduction of multiple selections and structural regular expressions, the assumptions of few and localized changes no longer hold. Resulting in a fragmented linked list and degraded performance due the fact that the core data structure is no longer the best fit for the current functionality.

This can be demonstrated with a concrete example: the global search and replacement command x/pattern/c/replacement/ has a complexity linear in the number of matches O(selections). Performing the same changes interactively by first selecting the desired text x/pattern/ followed by the change normal mode command and subsequent typing in insert mode, results in O(selections*keypress) insertions. Hence, the former is substantially faster.

Some desirable properties for a redesign are:

  • Better performance i.e. sub-linear time complexity

  • Space efficient, using structural sharing, supporting copy-on-write

  • Intrusive, avoiding superfluous pointer chasing, good cache locality, regular memory access patterns

  • Augmentable with arbitrary additional data

  • Fully persistent i.e. mutable access to prior revisions

  • Bulk operations

An improvement over the double linked list is to store the text pieces in some kind of balanced tree, where the nodes are augmented to track the size of their sub trees, thereby enabling positional queries and modifications in O(log n) time.

Full persistence, meaning non-destructive changes, preserving old versions, which can subsequently be restored and altered further, can be achieved using path copying. All tree nodes along the path from the root towards the modification point, not belonging to the same revision, are duplicated before any change takes place. Nodes would be reference counted to allow structural sharing, resource management and fast snapshots. This design also places some implementation constraints:

  • The tree structure has to be maintained without parent pointers, otherwise a single change would propagate throughout the tree, invalidating the whole data structure.

    This also implies that one-pass, top-down algorithms are beneficial. They don’t need a stack to hold the access path, error handling is potentially easier and in a concurrent setting only a fixed amount of nodes instead of a whole path needs to be locked.

  • Fewer structural changes (e.g. rotations or node splits/merges) are to be preferred, because they result in less copying. While smaller node sizes have the same effect, they also lead to worse cache locality due to irregular memory access patterns.

Marks, used to remember a certain text position, are currently implemented as simple value types. More concretely, they are denoted by a pointer representation uintptr_t which can be looked up by walking the aforementioned double linked list, resulting in O(n) complexity. Meaning they don’t need adjustments when the text is modified, but suffer from degraded performance as fragmentation increases. Improving upon that might require a more complex representation, a separately maintained lookup mechanism and explicit resource deallocation by the caller.

In preparation of a possible redesign, I separated out the higher level utility, iterator and I/O (file loading and saving) functionality, making it reusable for different core text data structure management implementations. To facilitate experimentation I wrote some additional tests, increasing test coverage. Found and fixed undefined behavior in the existing implementation and introduced rudimentary benchmarking infrastructure.