A Tale of Two Paths: Toward a Hybrid Data Plane for Efficient Far-Memory Applications

Motivation:

  • Diverse memory access:

    • real-world application exhibit complicated memory access patterns (e.g., sequential, skewed, and random), and access patterns depend on at least the computation model and the data model.

  • Granularity-performance tradeoff

    • Object fetching minimizes I/O amplification by fetching fine-grained objects. However, it does not always show clear benefits for workloads with good locality (i.e., data on the same pages are accessed close in time so that kernel can effectively and accurately prefetch data).

  • Object eviction cost

    • The reason that object fetching underperforms paging for programs with good locality is, profiling objects and maintaining object-based LRU for eviction come with high cost (an order of magnitude more compute cycles).

    • Moreover, if those profiling/eviction threads cannot obtain enough CPU resources, they end up scanning only a small percentage of objects before time runs out and evict objects with limited hotness information, resulting in data thrashing (i.e., hot objects get swapped out and quickly swapped back in).

  • Necessity of online profiling and path switching

    • offline profiling is not optimal:

      • access patterns can change dramatically in response to inputs

      • offline profiling cannot predict environmental changes (e.g., available CPU resources for object fetching)

Core Idea:

Insights:

  • continuous profiling idetifies patterns on-the-fly for different computation stages or parallel threads accessing different data structures.

  • for programs with irregular patterns, object fetching moves objects taht are accessed close in time into contiguous memory space, dynamically improving locality as the program executes.

Challenges and Solutions:

  1. how to continuously and accurately profile an application with low overhead?

  • Atlas divides a page into a set of cards, each of which is a unit for locality measurement.

  • It leverages the runtime (a read barrier) to compute a card access table (CAT) for each page, which is a bitmap where each bit corresponds to a card (i.e., consecutive 16 bytes) on the page, and a set bit means that card has been accessed since the page was allcoated or last swapped in.

  • A page with a high card access rate (CAR) is deemed to possess good locality and should be accessed with paging, while a page with a low CAR has poor locality and should be accessed with object fetching.

  1. how to dynamically switch access mechanisms?

  • Atlas uses a readd barrier at each smart pointer dereference. The barrier quickly checks a per-page path selector flag (PSF) for the remote page to be accessed.

  • Each PSF is a 1-bit flag, set to either runtime or paging.

  • The PSF of a page is updated only when the page is evicted based upon the page's CAR.

  • Atlas supports both object fetching and paging at ingress, but it evicts data only at the page granularity at egress, to reduce the high overhead associated with object-level profiling and LRU. While evicting pages may introduce I/O amplification for workloads with poor locality, this impact is insignificant under Atlas, because accesses in these workloads would likely go through the object fetching path, which improves locality by moving objects accessed close in time into contiguous local space. The enhanced locality effectively mitigates the negative impact of page-level eviction.

  • To reduce fragmentation resulting from dead objects, Atlas runs concurrent evacuation tasks the periodically move live objects into contiguous memory space. During each evacuation, Atlas groups recently-accessed objects into contiguous pages to improve data locality.

  1. how to synchronize the two access paths?

  • Kernel and the runtime are not coordinated (e.g., the kernel does not inform the runtime of the start or the completion of a page-fault handling), special care must be taken to prevent the 2 access paths from creating inconsistent data copies.

  • In particular, correctness issues may arise from a set of ingress and egress events (i.e., object-in, page-in, and page-out) that occur simultaneously.

  • Atlas designed a synchronization protocol, implemented with a combination of runtime and kernel support.

Details:

Memory Management:

  • Atlas does not handle huge objects that cannot fit into a page

Questions:

  • Atlas still needs to modify applications.

Like AIFM, Atlas requires programs to use smart pointers (i.e., to implement barriers) and declare dereference scopes for objecs (inspired by C++ weak pointers and Folly RCU guards). Objects are managed by Atlas's hybrid data plane.

  • To fetch a remote object, Atlas uses a read barrier at each smart pointer dereference, and the barrier quickly checks the PSF for the remote page to be accessed. And this PSF of the page is updated based on the page's CAR only when the page is evicted. However, the question is, the page's PSF and its CAT is based on the access pattern before its eviction, can we know that its future access pattern (after the fetching) will be consistent with its historical pattern?

  • It seems that Atlas is in the middle layer between application and the OS? Similar to JVM, both runtime!

To reduce fragmentation resulting from dead objects, Atlas runs concurrent evacuation tasks the periodically move live objects into contiguous memory space.

  • Think about the metrics! It seems that Atlas focus on application-level performances, like throughput and tail latency.

Our results show that Atlas enables these applications running on remote memory to achieve an overall of 1.5× and 3.2× throughput improvement, compared with AIFM and Fastswap, respectively. Atlas reduces the tail latency by one and two orders of magnitude when compared with AIFM and Fastswap.

  • For ingress, leverages hardware transactin memory?

Atlas leverages hardware transaction memory and, in particular, Intel’s TSX, to run a quick check—Atlas accesses the address in a hardware transaction, which aborts if the address is not on a mapped page.

  • For egress, the single path of paging to swap out data eliminates the maintaining of an object-level LRU. But this paging controlled by runtime? What about the kernel's mm for pages?

    • Atlas only has page-out, not object-out!

    • Does Atlas change the page-out logic of the kernel?

    • Does Atlas pins memory and manages by itself? Seems not... It said that "any page with a non-zero deref count is skipped when the kernel looks for swap-out victims".

      • A natural follow-up question is, is this related to any performance metric like memory requirement of the runtime? For example, if many pages has live objects, then we might need larger local memory?

      • Another question is, how can this runtime alleviate the memory thrashing problem for pages? After GC/evacuation, we might want to reclaim those pages with zero deref with higher priority. This requires corrporation with the kernel?

  • Signification modification and needs careful programming

Atlas uses an approach that is similar to Semeru. Developers need to explicitly define remoteable data structures and functions. We modify the paging system to ensure that a page at a virtual address A on the compute server is guaranteed to be still at address A on the remote server when evicted. Atlas requires users to guarantee a remotable data structure cannot reference a non-remotable object.

  • Diff between Atlas and our current idea?

    • Atlas focuses on hybrid data paths leveraging some more runtime information (like card access table and path selector flag). However, it is mainly about runtime implementation and requires application modification. Their runtime (Atlas) is co-existed with the kernel's existing mechanisms, but not aimed to reduce the effort in modifying existing application codes nor fine control the kernel-level page cache by injecting hints.

    • We can somehow use Atlas's idea? Try to regard Atlas just as a normal runtime (e.g., jvm), and we can pay more attention to the kernel side while we utilize the online profiling and object-level fetching idea in disaggregated memory?

      • First of all, figure out how Atlas corporate with kernel's page-level memory management!

      • the information after evacuation can be used by the kernel?

========

  • Why can not use a flag to represents local/remote? Why TSX?

In AIFM, this is done by using a bit in each pointer to encode the location of the referenced object—these pointers are updated once the objects they point to are swapped in or out. Atlas, however, cannot adopt this approach due to the use of the hybrid data plane—when data is paged out, Atlas cannot update any pointers. To solve the problem without incurring the cost of checking with the kernel at every read, Atlas leverages hardware transaction memory and, in particular, Intel’s TSX, to run a quick check—Atlas accesses the address in a hardware transaction, which aborts if the address is not on a mapped page.

  • Atlas reduces I/O amplication of paging in by using object in. Consider these scenarios:

    • A remote page with PSF flag runtime, but we want to fetch many objects in that page.

    • A remote page with PSF flag paging, but we only want an object in that page. Hence, we still have I/O amplification!

  • workflow:

    • check the page local or remote. If remote: check PSF, runtime or paging

    • Sync invariants:

      • object-in v.s. page-in (PSF flag)

      • object-in v.s. page-out (per-page deref count)

      • object-in v.s. evacuation (non-zero deref cnt prevnets the page from being evacuated)

Background Knowledge

Website:

Paper:

Slides: https://www.usenix.org/system/files/osdi24_slides_chen-lei.pdf

Github:

Last updated