Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Stumble upon reading ZeroMQ FAQ about a Thread safety.

My multi-threaded program keeps crashing in weird places inside the ZeroMQ library. What am I doing wrong?

ZeroMQ sockets are not thread-safe. This is covered in some detail in the Guide.

The short version is that sockets should not be shared between threads. We recommend creating a dedicated socket for each thread.

For those situations where a dedicated socket per thread is infeasible, a socket may be shared if and only if each thread executes a full memory barrier before accessing the socket. Most languages support a Mutex or Spinlock which will execute the full memory barrier on your behalf.

My multi-threaded program keeps crashing in weird places inside the ZeroMQ library.
What am I doing wrong?

Following is my following code:

Celluloid::ZMQ.init
module Scp
    module DataStore
    class DataSocket
        include Celluloid::ZMQ 
            def pull_socket(socket)
                @read_socket = Socket::Pull.new.tap do |read_socket|
                    ## IPC socket
                    read_socket.connect(socket)
                end
            end

            def push_socket(socket)
                @write_socket = Socket::Push.new.tap do |write_socket|
                    ## IPC socket
                    write_socket.connect(socket)
                end
            end

            def run
                pull_socket and push_socket and loopify!
            end

            def loopify!
                loop {
                   async.evaluate_response(read_socket.read_multipart)
                }
            end

            def evaluate_response(data)
                return_response(message_id,routing,Parser.parser(data))
            end

            def return_response(message_id,routing,object)
                data = object.to_response
                write_socket.send([message_id,routing,data])
            end
        end
    end
end  

DataSocket.new.run 

Now, there are couple things I'm unclear off:

1) Assuming that async spawns a new Thread ( every time ) and the write_socket is shared between the all threads and ZeroMQ says that their socket is not thread-safe. I certainly see the write_socket running into threads safety issue.
( Btw, hasn't faced this issue in all end to end testing thus far. )

Question 1 : Is my understanding correct on this?

To solve this, ZeroMQ asks us to achieve this using Mutex, Semaphore.

Which results in Question 2

2) Context Switching.

Given a threaded application can context switch anytime. Looking at the ffi-rzmq code Celluloid::ZMQ .send() internally calls send_strings(), which internally called send_multiple()

Question 2: Context Switching can happen ( anywhere ) inside ( even on critical section ) (here)[https://github.com/chuckremes/ffi-rzmq/blob/master/lib/ffi-rzmq/socket.rb#L510]

This can also lead to a data ordering issue.

Is my following observation correct?

Note:

Operating system ( MacOS, Linux and CentOS )  
Ruby - MRI 2.2.2/2.3.0
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
659 views
Welcome To Ask or Share your Answers For Others

1 Answer

No one ought risk the application robustness by putting it on thin ice

Forgive this story to be a rather long read, but authors life-long experience shows that reasons why are far more important than any few SLOCs of ( potentially doubtful or mystically-looking or root-cause-ignorant ) attempts to experimentally find how

Initial note

While ZeroMQ has for several decades been promoted as Zero-Sharing ( Zero-Blocking, ( almost )-Zero-Latency and a few more design-maxims. The best place to read about pros & cons are Pieter HINTJENS' books, not just the fabulous "Code Connected, Volume 1", but also the advanced design & engineering in real social-domain ones ) philosophy, the very recent API documentation has introduced and advertises some IMHO features with relaxed relation to these corner-stone principles for distributed-computing, that do not so sharp whistle on Zero-Sharing so loud. This said, I still remain a Zero-Sharing guy, so kindly view the rest of this post in this light.

Answer 1:
No, sir. -- or better -- Yes and No, sir.

ZeroMQ does not ask one to use Mutex/Semaphore barriers. This is something contradicting the ZeroMQ design maxims.

Yes, recent API changes started to mention that ( under some additional conditions ) one may start using shared-sockets ... with ( many ) additional measures ... so the implication was reversed. If one "wants", the one also takes all the additional steps and measures ( and pays all the initally hidden design & implementation costs for "allowing" shared toys to ( hopefully ) survive the principal ( un-necessary ) battle with the rest of the uncontrollable distributed-system environment -- thus suddenly also bearing a risk of failing ( which was for many wise reasons not the case in the inital ZeroMQ Zero-sharing evangelisation ) -- so, user decides on which path to go. That is fair. ).

Sound & robust designs IMHO still had better develop as per initial ZeroMQ API & evangelism, where Zero-sharing was a principle.

Answer 2:
There is by-design always a principal uncertainty about ZeroMQ data-flow ordering, one of ZeroMQ design-maxims keeps designers not to rely on unsupported assumptions on message ordering and many others ( exceptions apply ). There is just a certainty that any message dispatched into the ZeroMQ infrastructure is either delivered as a complete-message, or not delivered at all. So one can be sure just about the fact, that no fragmented wrecks ever appear on delivery. For furhter details, read below.


ThreadId does not prove anything ( unless inproc transport-class used )

Given the internal design of ZeroMQ data-pumping engines, the instantiation of a
zmq.Context( number_of_IO_threads ) decides on how many threads get spawned for handling the future data-flows. This could be anywhere { 0, 1: default, 2, .. } up to almost depleting the kernel-fixed max-number-of-threads. The value of 0 gives a reasonable choice not to waste resources in case, where inproc:// transport-class is actually a direct-memory region mapped handling of data-flow ( that actually never flow ang get nailed down directly into the landing-pad of the receiving socket-abstraction :o) ) and no thread is ever needed for such job.
Next to this, the <aSocket>.setsockopt( zmq.AFFINITY, <anIoThreadEnumID#> ) permits to fine-tune the data-related IO-"hydraulics", so as to prioritise, load-balance, performance-tweak the thread-loads onto the enumerated pool of zmq.Context()-instance's IO-threads and gain from better and best settings in the above listed design & data-flow operations aspects.


The cornerstone-element is the Context()s' instance,
not a Socket()'s one

Once a Context()'s instance got instantiated and configured ( ref. above why and how ), it is ( almost ) free-to-be-shared ( if design cannot resist from sharing or has a need to avoid a setup of a fully fledged distributed-computing infrastructure ).

In other words, the brain is always inside the zmq.Context()'s instance - all the socket-related dFSA-engines are setup / configured / operated there ( yes, even though the syntax is <aSocket>.setsockopt(...) the effect of such is implemented inside The Brain -- in the respective zmq.Context - not in some wire-from-A-to-B.

Better never share <aSocket> ( even if API-4.2.2+ promises you could )

So far, one might have seen a lot of code-snippets, where ZeroMQ Context and it's sockets get instantiated and disposed off in a snap, serving just a few SLOC-s in a row, but -- this does not mean, that such practice is wise or adjusted by any other need than a that very academic example ( that was made in just a need to get printed in as few SLOCs as possible, because of the book publisher's policies ).

Even in such cases a fair warning about indeed immense costs of zmq.Context infrastructure setup / tear-down ought be present, thus to avoid any generalisation, the less any copy/paste replicas of such the code, that was used short-handedly just for such illustrative purposes.

Just imagine the realistic setups needed to take place for any single Context instance -- to get ready a pool of respective dFSA-engines, maintaining all their respective configuration setups plus all the socket-end-point pools related transport-class specific hardware + external O/S-services handlers, round-robin event-scanners, buffer-memory-pools allocations + their dynamic-allocators etc, etc. This all takes both time and O/S resources, so handle these ( natural ) costs wisely and with care for adjusted overheads, if performance is not to suffer.

If still in doubt why to mention this, just imagine if anybody would insist of tearing down all the LAN-cables right after a packet was sent and having a need to wait until a new cabling gets installed right before a need to sent the next packet appears. Hope this "reasonable-instantiation" view could be now better percepted and an argument to share ( if at all ) a zmq.Context()-instance(s), without any further fights for trying to share ZeroMQ socket-instances ( even if newly becoming ( almost ) thread-safe per-se ).

The ZeroMQ philosophy is robust if taken as an advanced design evangelism for high performance distributed-computing infrastructures. Tweaking just one ( minor ) aspect typically does not adjust all the efforts and costs as on the global view on how to design safe and performant systems, the result would not move a single bit better ( and even the absolutely-share-able risk-free ( if that were ever possible ) socket-instances will not change this, whereas all the benefits for sound-design, clean-code and reasonably achievable test-ability & debugging will get lost ) if just this one detail gets changed -- So, rather pull another wire from an existing brain to such a new thread, or equip a new thread with it's own brain, that will locally handle it's resources and allow it to connect own wires back to all other brains -- as necessary to communicate to -- in the distributed-system ).

If still in doubts, try to imagine what would happen to your national olympic hockey-team, if it were sharing just one single hockey-stick during the tournament. Or how would you like, if all neighbours in your home-town would share the same phone number to answer all the many incoming calls ( yes, with ringing all the phones and mobiles, sharing the same number, at the same time ). How well would that work?


Language bindings need not reflect all the API-features available

Here, one can raise, and in some cases being correct, that not all ZeroMQ language-bindings or all popular framework-wrappers keep all API-details exposed to user for application-level programming ( author of this post has struggled for a long time with such legacy conflicts, that remained unresolvable right to this reason and had to scratch his head a lot to find any feasible way to get around this fact - so it is ( almost ) always doable )


Epilogue:

It is fair to note, that recent versions of ZeroMQ API 4.2.2+ started to creep the inital evangelisated principles.

Nevertheless, worth to remember the anxient memento mori

( emphases added, capitalisation not )

Thread safety

?MQ has both thread safe socket type and not thread safe socket types. Applications MUST NOT use a not thread safe socket from multiple threads except after migrating a socket from one thread to another with a "full fence" memory barrier.

Following are the thread safe sockets: * ZMQ_CLIENT * ZMQ_SERVER * ZMQ_DISH * ZMQ_RADIO * ZMQ_SCATTER * ZMQ_GATHER

While this text might sound to some ears as a promising, calling barriers to service is the worst thing one can do in designing advanced distributed-computing systems, where performance is a must.

The last thing one would like to see is to block one's own code, as such agent gets into a principally uncontrollable blocking-state, where no-one can heel it from ( neither the agent per-se internally, nor anyone from outside ), in case a remote agent never delivers a-just-expected event ( which in distributed-systems can happen by so many reasons or under so many circumstances that are outside of one's control).

Building a system that is prone to hang itself ( with a broad smile of supported ( but naively employed ) syntax-possibility ) is indeed nothing happy to do, the less a serious design job.

One would also not become surprised here, that many additional ( initially not visible ) restrictions apply down the line of the new moves into using shared-{ hockey-stick | telephones } API:

ZMQ_CLIENT sockets are threadsafe. They do not accept the ZMQ_SNDMORE option on sends not ZMQ_RCVMORE on receives. This limits them to single part data. The intention is to extend the API to allow scatter/gather of multi-part data.

c/a

Celluloid::ZMQ does not report any of these new-API-( a sin of sharing almost forgiving ) socket types in its section on supported socket typed so no good news to be expected a-priori and Celluloid::ZMQ master activity seems to have faded out somewhere in 2015, so expectations ought be somewhat realistic from this corner.

This said, one interesting point might be found behind a notice:

before you go building your own distributed Celluloid systems with Celluloid::ZMQ, be sure to give DCell a look and decide if it fits your purposes.


Last but not least, combining event-loop system inside another event-loop is a painful job. Trying to integrate an embedded hard-real-time system into another hard-real-time system could even mathematically prove itself to be impossible.

Similarly, building multi-agent system using another agent-bas


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...