We are implementing a pilot that reads from Kafka and writes to BigQuery.
Simple pipeline:
- KafkaIO.read
- BigQueryIO.write
We switched off the auto-commit.
And we are using?commitOffsetsInFinalize()
Can this setup guarantee that message will appear at least once in BigQuery and will not be lost, provided that everything is ok on the BigQueryIO side?
In the documentation for commitOffsetsInFinalize()
I've met the following:
?
It helps with minimizing gaps or duplicate processing of records while restarting a pipeline from scratch
I'm curious?what "gaps" here are being?referred to?
If you consider the edge cases, is there a possibility of messages being skipped and not delivered to BQ?
question from:https://stackoverflow.com/questions/65951040/apache-beam-kafkaio-at-least-once-semantics