I am trying to improve my lookup table run time.
dest_df = pd.DataFrame({"dest":["uk LHR","from ROM","City:LONDON","planetoronto"," rome rome","junk plane"]}) ## 300,000 rows
city_df_lookup=pd.DataFrame({"places":["london"," paris","toronto","rome"],
"code":["LHR","PAR","YTO","ROM"]}) ## around 10,000 rows
code = city_df_lookup.code.tolist()
places = city_df_lookup.places.tolist()
def select(x):
for co, pl in zip(code, places):
if co in x:
return pl
dest_df["clean_dest"] = dest_df["dest"].apply(select)
dest_df.head()
dest dest_match
0 uk LHR london
1 from ROM rome
2 City:LONDON None
3 Planetoronto None
4 rome rome None
5 junk plane None
Unfortunately, the code above takes too long and i would also like the loop to try and string match between city_df_lookup.places and dest_df.dest columns
The desired output is:
dest dest_match
0 uk LHR london
1 from ROM rome
2 City:LONDON london
3 Planetoronto tornoto
4 rome rome rome
5 junk plane No Match
I was thinking of using ahocorasick but not sure if there is a simpler method.
question from:https://stackoverflow.com/questions/65904049/large-scale-string-matching-between-different-dataframes-python