apache spark - What is the difference between cache and persist?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

apache spark - What is the difference between cache and persist?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.3k views

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:50:11+0000

With cache(), you use only the default storage level :

MEMORY_ONLY for RDD
MEMORY_AND_DISK for Dataset

With persist(), you can specify which storage level you want for both RDD and Dataset.

From the official docs:

You can mark an RDD to be persisted using the persist() or cache() methods on it.

each persisted RDD can be stored using a different storage level

The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

Use persist() if you want to assign a storage level other than :

MEMORY_ONLY to the RDD
or MEMORY_AND_DISK for Dataset

Interesting link for the official documentation : which storage level to choose

Categories

apache spark - What is the difference between cache and persist?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags