html - Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

Question

Ask a Question

Welcome To Ask or Share your Answers For Others

html - Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I have the following:

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

Thank you in advance and will be sure to upvote/accept answer!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1.4k views

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:36:26+0000

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

Or you could pass a lambda to .find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

Using .find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]

Categories

html - Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags