Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

We are implementing a pilot that reads from Kafka and writes to BigQuery.

Simple pipeline:

  • KafkaIO.read
  • BigQueryIO.write

We switched off the auto-commit. And we are using?commitOffsetsInFinalize()

Can this setup guarantee that message will appear at least once in BigQuery and will not be lost, provided that everything is ok on the BigQueryIO side?

In the documentation for commitOffsetsInFinalize() I've met the following: ?

It helps with minimizing gaps or duplicate processing of records while restarting a pipeline from scratch

I'm curious?what "gaps" here are being?referred to?

If you consider the edge cases, is there a possibility of messages being skipped and not delivered to BQ?

question from:https://stackoverflow.com/questions/65951040/apache-beam-kafkaio-at-least-once-semantics

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.4k views
Welcome To Ask or Share your Answers For Others

1 Answer

Committing the offset for Apache Kafka means that if you were to restart your pipeline, it would start at the position within the stream before you restarted. Dataflow does have a guarantee that data won't be dropped when writing to BigQuery. But, using a distributed system, there is always a potential for something to go wrong (for example, a GCP outage).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...