c++ - Implementing 64 bit atomic counter with 32 bit atomics

Question

Welcome To Ask or Share your Answers For Others

c++ - Implementing 64 bit atomic counter with 32 bit atomics

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

I would like to cobble together a uint64 atomic counter from atomic uint32s. The counter has a single writer and multiple readers. The writer is a signal handler so it must not block.

My idea is to use a generation count with the low bit as a read lock. The reader retries until the generation count is stable across the read, and the low bit is unset.

Is the following code correct in design and use of memory ordering? Is there a better way?

using namespace std;
class counter {
    atomic<uint32_t> lo_{};
    atomic<uint32_t> hi_{};
    atomic<uint32_t> gen_{};

    uint64_t read() const {
        auto acquire = memory_order_acquire;
        uint32_t lo, hi, gen1, gen2;
        do {
            gen1 = gen_.load(acquire);
            lo = lo_.load(acquire);
            hi = hi_.load(acquire);
            gen2 = gen_.load(acquire);
        } while (gen1 != gen2 || (gen1 & 1));
        return (uint64_t(hi) << 32) | lo;
    }

    void increment() {
        auto release = memory_order_release;
        gen_.fetch_add(1, release);
        uint32_t newlo = 1 + lo_.fetch_add(1, release);
        if (newlo == 0) {
            hi_.fetch_add(1, release);
        }
        gen_.fetch_add(1, release);
    }
};

edit: whoops, fixed auto acquire = memory_order_release;

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

360 views

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:08:38+0000

This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there's only one writer so no extra support for excluding simultaneous writers is needed.)

You don't need or want the increment of the counter variable itself to use atomic RMW operations. You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap relaxed or release memory order, and using a release store for the 2nd counter update).

Similarly the counter also doesn't need to be an atomic RMW.

The writer only needs pure loads and pure stores with only release ordering, which are (much) cheaper than atomic RMW, or stores with seq_cst ordering:

load the counter and the value in any order
store a new counter (old+1)
store the new value (or just update the low half if you want to branch on no carry)
store the final counter.

The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don't really want the cost of making both stores of both halves of the value release, on CPUs where that's more expensive than relaxed.

Unfortunately to satisfy C++ rules, the value has to be atomic<T>, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARM ldp / stp load-pair might not be atomic, but that doesn't matter. (And compilers often won't optimize two separate atomic 32-bit loads into one wider load.)

Values other threads read while the sequence-counter is odd are irrelevant, but we'd like to avoid undefined behaviour. Maybe we could use a union of a volatile uint64_t and an atomic<uint64_t>

I wrote this C++ SeqLock<class T> template for another question I didn't finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).

This tries to check if the target already supports lock-free atomic operations on atomic<T> to stop you from using this when it's pointless. (Disable that for testing purposed by defining IGNORE_SIZECHECK.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using a static_assert.

I provided an inc() function for T that supports a ++ operator. TODO would be an apply() that accepts a lambda to do something to a T, and store the result between sequence counter updates.

// **UNTESTED**

#include <atomic>

#ifdef UNIPROCESSOR
// all readers and writers run on the same core
// ordering instructions at compile time is all that's necessary
#define ATOMIC_FENCE std::atomic_signal_fence
#else
// A reader can be running on another core while writing
// memory barriers or ARMv8 acquire / release loads / store are needed
#define ATOMIC_FENCE std::atomic_thread_fence
#endif
// using fences instead of .store(std::memory_order_release) will stop the compiler
// from taking advantage of a release-store instruction, like on AArch64 or x86


// SINGLE WRITER only.
// uses volatile + barriers for the data itself, like pre-C++11
template <class T>
class SeqLocked
{
#ifndef IGNORE_SIZECHECK
    // sizeof(T) > sizeof(unsigned)
    static_assert(!std::atomic<T>::is_always_lock_free, "A Seq Lock with a type small enough to be atomic on its own is totally pointless, and we don't have a specialization that replaces it with a straight wrapper for atomic<T>");
#endif

       // C++17 doesn't have a good way to express a load that doesn't care about tearing
       //  without explicitly writing it as multiple small parts and thus gimping the compiler if it can use larger loads
    volatile T data;          // volatile should be fine on any implementation where pre-C++11 lockless code was possible with volatile,
                              //  even though Data Race UB does apply to volatile variables in ISO C++11 and later.

    std::atomic<unsigned> seqcount{0};  // Even means valid, odd means modification in progress.
                                        //  unsigned wraps around at a power of 2 on overflow

public:
    T get() const {
        unsigned c0, c1;
        T tmp;

        do {
            c0 = seqcount.load(std::memory_order_relaxed);  // or this can be a std::memory_order_acquire for multicore so AArch64 can use LDAR
            ATOMIC_FENCE(std::memory_order_acquire);

            tmp = (T)data;       // load

            ATOMIC_FENCE(std::memory_order_acquire);  // LoadLoad barrier
            c1 = seqcount.load(std::memory_order_relaxed);
        } while(c0&1 || c0 != c1);     // retry if the counter changed or is odd

        return tmp;
    }

    // TODO: a version of this that takes a lambda for the operation on tmp
    T inc() {
        unsigned orig_count = seqcount.load(std::memory_order_relaxed);

        seqcount.store(orig_count+1, std::memory_order_relaxed);
        ATOMIC_FENCE(std::memory_order_release);
        // make sure the data stores appear after the first counter update.

        T tmp = data;  // load
        ++tmp;
        data = tmp;    // store

        ATOMIC_FENCE(std::memory_order_release);
        seqcount.store(orig_count+2, std::memory_order_relaxed);  // Or use mo_release here, better on AArch64

        return tmp;
    }

    void set(T newval) {
        unsigned orig_count = seqcount.load(std::memory_order_relaxed);

        seqcount.store(orig_count+1, std::memory_order_relaxed);
        ATOMIC_FENCE(std::memory_order_release);
        // make sure the data stores appear after the first counter update.

        data = newval;    // store

        ATOMIC_FENCE(std::memory_order_release);
        seqcount.store(orig_count+2, std::memory_order_relaxed);  // Or use mo_release here, better on AArch64
    }

};


/***** test callers *******/
#include <stdint.h>

struct sixteenbyte {
    //unsigned arr[4];
    unsigned long  a,b,c,d;
    sixteenbyte() = default;
    sixteenbyte(const volatile sixteenbyte &old)
         : a(old.a), b(old.b), c(old.c), d(old.d) {}
    //arr(old.arr) {}
};

void test_inc(SeqLocked<uint64_t> &obj) {  obj.inc(); }
sixteenbyte test_get(SeqLocked<sixteenbyte> &obj) { return obj.get(); }
//void test_set(SeqLocked<sixteenbyte> &obj, sixteenbyte val) { obj.set(val); }

uint64_t test_get(SeqLocked<uint64_t> &obj) {
    return obj.get();
}

// void atomic_inc_u64_seq_cst(std::atomic<uint64_t> &a) { ++a; }
uint64_t u64_inc_relaxed(std::atomic<uint64_t> &a) {
    // same but without dmb barriers
    return 1 + a.fetch_add(1, std::memory_order_relaxed);
}

uint64_t u64_load_relaxed(std::atomic<uint64_t> &a) {
    // gcc uses LDREXD, not just LDRD?
    return a.load(std::memory_order_relaxed);
}

void u64_store_relaxed(std::atomic<uint64_t> &a, uint64_t val) {
    // gcc uses a LL/SC retry loop even for a pure store?
    a.store(val, std::memory_order_relaxed);
}

It compiles to the asm we want <a href="https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAM1QDsCBlZAQwBtMQBGAOgCYAOAJyCALAFYADPwDswucNIArLqVbNaoZAFJeAIR27SAZ1QBXYsg4ByAPQ2A1ACpHAVQByAFWwMvAEWdaEgCCgSG8AMx4tMisplj2WuEAwsxEALZ42uHYobkReFRYVPbuAJIACgBKAPJJ3gzVlaF29mys9sSYzFjERq206PYA7sSEmL0dprT2dPYECJj2Rsxpi2idzQ4kPVHA9lFGBMSmyAR4dH2p9mhpAA547HN4q/uXrO3zqWBWfbSYlkZlsQAJ55cJFKKLIIeaoAWVKSQA%2BgAxbBuOpLAjoEAgVKoDLIRFGPDAWhsRFUTDRTBgzCsIw04ItIIdLo9a7qewAI0Wx1otF2M2m6lQ83G1xIiyGCAeUtGZw0m3sqzSJGB3OYxFG4z6JHsQUqsIAbvxWsgAI6mPCdewOTrsZgM%2BysVDdPoOQ6S1o2v6YLDoMEQv76mHwpGo9HYTHY3HpTKI%2BadboUqmWWkDApK0zEjT2SnUvoHAhsmbFbiezoQQ4xlVqxHbcaI%2B1dBkASmGD3antucwWErusuISqoxHxc2YAGtBd0jeoCMxgItUMVmKyHQyALQVxZF46nc60UjOvATpfTIJBCwIABsIhmxHsAA9%2BNfcsElQxSm4AOIAGSjADqlSlF4lRCqwwLcFmDJ9EaqBqGcjwGBqWp4DqeZ6mK9joKkq6EAyrBUEerAnostydBuSQGAYnCcKExZ3AhiyJEkMSOn0HiJDkwRsYC9gMJg5q/qgyCngG77SPo775FQAyYMUpTfm4jTYIin4AFrYEkAASWkANKhPYRm2h6eAAF6YMuEAeO2XFLOZllUBAUzEqSfqtoZxmHKk8bseMBAQGAYDVjieIEixnHZDieBGIibBDMwwKxS6okUp0mBHjovAsgJ5r2MJokdvMrRzMCtyLEYaRtPYVJmMACBzKg3KLGFmRCvsBC6kM0wxY187vOqtyoFEBDsICR7qIMQyLOgdDfAQ9gIMwRotUs5XIHgbDmT5syfAtnS3GoAIdUVDWrocxDMCSCALSMzC3OVD40A%2BrVZEkkU5LwvAeeEUlhOEVL4FQb4ssZxktFReg0dI2GoJgRi0PNi3LatwCoBgwyJY1NWPuR8OXM6rqDHtsPw4jVgLSwNrMFyZgLcWmq7J5YMmUZQyEAgdM44dmSEBBwzyoKhCtH0aSmKwZyHRVVXvPYtyap1/TEwg2b2MAzz3LmWE3Pc7APgUJ0sNM2aLGoxCLg%2BLpusz9hwQhsr2B42G4YkugsyzLR2z5jxGJz4uDDyeaQu16jqhr7CrPQO3TNK4xkRRkP6HotGE6JY19Gg8QJX0Q2AngXKPOzxVe4hGU2%2B7FeVx7Dg1St0zzHVDW%2BLh9iVMwlglG7s3w60D380Qtvwd7iyzqMNNjfs0ylA09iJzRnBK86qTjFB77BGDIWxvimQsS5JJ/AG2RLIJaBTAQWiSRIF%2B%2BK7RktNgdfKl0tCwVt6BHhggyrOoosYAUmTR0nnLUcwBOiAlXqDKuUDoGVxaPYPebkpqXVuJcUcUxBhXFXENaaD5lz2F4O1VAK1iBUBdEMEGtxTAF0yCAG2TtFwBXbGgF%2BC0L5/Urggg%2B1wJBHmQHRX65djJOwIHcV2INK6zQSJJQR7tkASASOEXwx9zSn3oNwK26AqxYhxLWEE9ZiA9CbHSZgj53K31ZphGUGdOSB3OtokAujgT6MMe3S01pFjPWVOLM46wKpNQvFeW8HJjZOl/L4A0MiWbQjhAiFEaI6haJrJgVUeiGzEDihaK0nQfp/UiWDERPZEhKOsq2HC85zHV0Jt0cR0DolhjiZGRJOjkl1jSRktx2TzEtGEt0HpAdNTaiHOvaBfCFFKIZCoswaiNFNIcS01JBjGz2hMWYgRwz3bX2GDKdgEA5E6GvAvC%2BVFpBJG4fYIKRTricByZUzoRww7FG1lM4sD5kBLQ0H6e8rwZjoHEiEdZLM7nmHrqItZkCpE3zXuCloMJfDVBACVYhxJZh4PmL1Em85TwEzUGkLkOEMIPiwqgR6gDdqiIBU7KIyAIC2WkQCsGnDPkkBJIiVRrDFHKLZeoomszHHOKWcY0x6Ack1PdhMrl24IDMuAKy55NEjybz5W05ZQqRX0uMnU2JEYEmKvmU45VdIWyYDVeCsGLQqqniWOYRYWEyl2MlJcB6XQXpUBeb2Dx1pDgSjPuKUwtwymYAgZE4RdwxnO3Kb9O%2BDgNGROonoApYj1VGTtWGhNkbwYeiIBsKFldNXhnidgXler%2BXpObI6Y1ibTVeRPs88sWbMBStGDKtlBheAKvsUqxZpbBWrLdqzaoD4TbKlQEY9ciwFidCPDyAgbrZgBLebeUVgLMAEGBXMUF7DjLXyXXBPAgwGQBSdn8IYs5WC0s3e7RlgxpWyrPmG8VtaZm6pSfqrto6VnCsrZEh9Z862SkbSyltyd21JJfSW99qrK25tDFqgtRawMGrHSa2BDgLUVWte68N9qwG93KpqVorrxRYSoJ6ymzzfX%2BuXkGpNWGw3HtPRU1m24l1gzzQ0nVHbi2IaNchiuP61GSpvUB3QbbozNIQ2%2BlVvao32AHfAp0qpR1GsWnHKdK7Z3nkvAukQNtt1Qu3WCkGNhnAmbmPDSmbR0ImeszYMEVLYjxBYtWEa3AEBcRBhdE4C1iSPmLFSLkwJixSIvSZK93ohxiF0DpsQN8wUMpfvvT5LpcytFIFyUgyBSCHxCz5vztAAvFhpWGoozBvFQa8ngXzmB/OBYbcwr1JcHa5eq/l2rCReDXngp%2Bmj9gEXMClawdA3BmCtinQNobXJRvXHG9wZAU3NFde4MK4LenoU2AGTNgZ57VsGdycEXdxNzOIipRAXKBUxK7xGreBMdl9moC5Ioc9Rl7uKG4Cdm5q3ms1aC8WQ4iIGGncEudv0TnKt5YKzSI%2Bd2HtPaBcQaYL3uAA4%2B9ISFQQ7AHbM39g9gOhIiQu8kL7rXiy3Y6y9hVYOWsQ9tmwJ7iOcenpR2j0IVp6DXfpkdgHZ38cg%2BSKzgg7PSedZhytgFcOEcPaRyumlX7UdGYcJj16x3oiIlMNdiZrLDizNepdtnIgbtQ46yN4L7Wk66GYLfVb/P2fwOu1SiDfptdxjetb/X59DfXmN2wm2LRlgvC5KYW6HMuboFxShQZRgbbi/sIcvQrRuCUgIG8uKvyICcBA%2BJ1pkme2fsM3LqFruEy2/1xoh3mjN46751dt3Quvd0rW2rZAyB5M9zCZUbAAANXwR5aCinsIobMC02%2BxeRFHlda7mDcu6PBrPhipO59yfn/5QRMdq/19uMvTvt4u%2BrwbqM%2BzmBHkLwtRnouG/ACby3gmv5fw2AYKcu5IJCbEtrlSAlJVKE2mY%2BEUfALJ%2BStPQzzmQkznxzxNVWysFG1YBACsDECsFIFoBgIkHgNQBgLnjjxMHMA7h0HCE4HgIICQMgNGwnBAHCDED4H4DEH4BEAkF4DECoIkE4GkFUBgJEHgMQKsGQNIFQKsHgKMBAB4QIM4MgNIDgFgBgEQBQHxF1nGHIEoB1kHFxGIDSGAH4F4B4RIwlh1EoC5EILSyiE1GBBgLwIy3xEjgIGqFoAgj0KwCqg0HYBsPcX3BWn4OENIEwFMWQED2sBMJGjpD0JIi5EuhBCSAwB8PwNGDSGMJEJoHoCYDYA4B4AEGEHECkFkHkAUDUA0BQDjUMECP4MgFG2JTOAuBgI3F8EUmUnbzUlKE0h0n0nsA3AAmqiaI8KOFXA3GqHCEaOrCKVJFMFyKYMaLSGQEoSKXWGLEfA3GYE4DgOMDMAsESKgJgLmI4K4J4INFhEb2b34D4HsAgFwEID1BwPT1nmkMHHa1wPbHQN0HwMINbGINIIkG4GkF4DkAkGkGkAkAkBEF4E4EEHCBYKsDYNICiNeO4GvAQL0J4L4IENICEOQNGzEMkLpkoQIDkN2XOL1hUD9COOIBUDuluGiOWNgPYOhJgJONOn1ENG2PsF2N4DuOEIeNIAWG6HGEoBJJBKiN4GkG4BoO%2BIFMFIFLbTWJQJgNhMEPu

Categories

c++ - Implementing 64 bit atomic counter with 32 bit atomics

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags