We have ETL jobs in Python (Luigi). They all connect to Hive Metastore to get partitions info.
Code:
from hive_metastore import ThriftHiveMetastore
client = ThriftHiveMetastore.Client(protocol)
partitions = client.get_partition_names('sales', 'salesdetail', -1)
-1 is max_parts (max partitions returned)
It randomly times out like this:
File "/opt/conda/envs/etl/lib/python2.7/site-packages/luigi/contrib/hive.py", line 210, in _existing_partitions
partition_strings = client.get_partition_names(database, table, -1)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1703, in get_partition_names
return self.recv_get_partition_names()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1716, in recv_get_partition_names
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
buff = self.trans.readAll(4)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz - have)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 159, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 105, in read
buff = self.handle.recv(sz)
timeout: timed out
This error happens occasionally.
There is 15 minute timeout on Hive Metastore.
When I investigate to run get_partition_names separately, it returns data within a few seconds.
Even when I set socket.timeout to 1 or 2 seconds, query completes.
There is no record of socket close connection message in Hive metastore logs cat /var/log/hive/..log.out
The tables it usually times out on have large number of partitions ~10K+. But as mentioned before, they only time out randomly. And they return partitions metadata quickly when that portion of code alone is tested.
Any ideas why it times out randomly, or how to catch these timeout errors in metastore logs, or how to fix them ?
question from:https://stackoverflow.com/questions/65898409/python-hive-metastore-partition-timeout