Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.6k views
Welcome To Ask or Share your Answers For Others

1 Answer

With cache(), you use only the default storage level :

  • MEMORY_ONLY for RDD
  • MEMORY_AND_DISK for Dataset

With persist(), you can specify which storage level you want for both RDD and Dataset.

From the official docs:

  • You can mark an RDD to be persisted using the persist() or cache() methods on it.
  • each persisted RDD can be stored using a different storage level
  • The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

Use persist() if you want to assign a storage level other than :

  • MEMORY_ONLY to the RDD
  • or MEMORY_AND_DISK for Dataset

Interesting link for the official documentation : which storage level to choose


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...