python - Why does linkextractor skip link?

Question

Welcome To Ask or Share your Answers For Others

python - Why does linkextractor skip link?

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

I am Scraping some pages and am trying to use the LinkExtractor to get the URLs from the response. In general that is going quite ok, but the LinkExtractor is not able to extract the relative link to a pdf file that is found at line 111 of the html

I have tried a lot, but haven't been able to figure out how and when the linkextractor drops the relative links, and whether this is how the extractor is supposed to work

scrapy shell "https://www.limburg.nl/algemeen/zoeken/?mode=zoek&ajax=true&zoeken_sortering=Num&pager_page=4"
from scrapy.linkextractors import IGNORED_EXTENSIONS
skip_extensions = list(set(IGNORED_EXTENSIONS) - set("pdf")) + ["gz","txt","csv","cls","and","bib","xml","dat","dpr","cfg","bdsproj","dproj","local","tvsconfig","res","dsk"]
extractor = LinkExtractor(deny_extensions=skip_extensions)
extractor.extract_links(response)

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.9k views

1 Answer

深蓝 · Answer 1 · 2021-01-26T20:36:13+0000

set() takes a sequence as an argument and makes a set of each item in the sequence. Strings are sequences of individual characters, so set("pdf") makes a set of the characters p d f.

If you want the whole string "pdf" in the set, then you need to enclose it in a list:

set(["pdf"])

Or it might be simpler to use {} notation instead of calling set():

{"pdf"}

Categories

python - Why does linkextractor skip link?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags