python - How to download and save all PDF from a dynamic web?

Question

Welcome To Ask or Share your Answers For Others

python - How to download and save all PDF from a dynamic web?

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I am trying to download and save in a folder all the PDFs contained in some webs with dynamic elements i.e: https://www.bankinter.com/banca/nav/documentos-datos-fundamentales

Every PDF in this url have similar href. Here they are two of them: "https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc=workspace://SpacesStore/fb029023-dd29-47d5-8927-31021d834757;1.0&nameDoc=ISIN_ES0213679FW7_41-Bonos_EstructuradosGarantizad_19.16_es.pdf"

"https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc=workspace://SpacesStore/852a7524-f21c-45e8-a8d9-1a75ce0f8286;1.1&nameDoc=20-Dep.Estruc.Cont.Financieros_18.1_es.pdf"

Here it is what I did for another web, this code is working as desired:

link = 'https://www.bankia.es/estaticos/documentosPRIIPS/json/jsonSimple.txt'
base = 'https://www.bankia.es/estaticos/documentosPRIIPS/{}'

dirf = os.environ['USERPROFILE'] + "DocumentsTFMPdfFolder"
if not os.path.exists(dirf2):os.makedirs(dirf2)
os.chdir(dirf2)

res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
for item in res.json():
    if not 'nombre_de_fichero' in item: continue
    link = base.format(item['nombre_de_fichero'])
    filename_bankia = item['nombre_de_fichero'].split('.')[-2] + ".PDF"
    with open(filename_bankia, 'wb') as f:
        f.write(requests.get(link).content)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

9.1k views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:26:41+0000

You have to make a post http requests with appropriate json parameter. Once you get the response, you have to parse two fields objectId and nombreFichero to use them to build right links to the pdf's. The following should work:

import os
import json
import requests

url = 'https://bancaonline.bankinter.com/publico/rs/documentacionPrix/list'
base = 'https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc={}&nameDoc={}'
payload = {"cod_categoria": 2,"cod_familia": 3,"divisaDestino": None,"vencimiento": None,"edadActuarial": None}

dirf = os.environ['USERPROFILE'] + "DesktopPdfFolder"
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)

r = requests.post(url,json=payload)
for item in r.json():
    objectId = item['objectId']
    nombreFichero = item['nombreFichero'].replace(" ","_")
    filename = nombreFichero.split('.')[-2] + ".PDF"
    link = base.format(objectId,nombreFichero)
    with open(filename, 'wb') as f:
        f.write(requests.get(link).content)

After executing the above script, wait a little for it to work as the site is real slow.

Categories

python - How to download and save all PDF from a dynamic web?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags