如何在HTML源代码中提取href属性

时间:2019-09-22 23:59:29

标签: python html web-scraping beautifulsoup

这是我正在处理的HTML源代码:

<a href="/people/charles-adams" class="gridlist__link">

所以我要做的是使用beautifulsoup模块提取href属性,在这种情况下为“ / people / charles-adams”。我需要这样做,是因为我想使用该特定网页的soup.findAll方法获取html源代码。但我正在努力从网页中提取此类属性。有人可以帮我解决这个问题吗?

P.S。 我正在使用此方法通过Python模块beautifulSoup获取html源代码:

request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')

2 个答案:

答案 0 :(得分:0)

尝试类似的东西:


    import java.util.Scanner;
    public class arrayexcersisespart3num1 {

        public static void main(String []arg) {
            Scanner input = new Scanner(System.in);
            noDuplicates(input);
        }
        public static void noDuplicates(Scanner input) {
            boolean check = true;
            int jumbo;
            int[]noDuplicates = new int [7];
            System.out.println("Please enter a unique Name");

            for (int i = 0; i<noDuplicates.length;) {
                System.out.println("Enter a number");
                jumbo = input.nextInt();
                while(check ==true|| i>0) {
                    check = false;
                    System.out.println("Please enter another number");
                    jumbo = input.nextInt();
                    if (jumbo==(noDuplicates[i])) {
                        check = true;
                        System.out.println("this Name has been previously added. Please choose another number");
                    }
                }
                jumbo = noDuplicates[i];
                System.out.print("this Number has been previously successfully added in position ");
                System.out.println(i+1);
                check = false;
                i++;
            }                                   
        }
    }

它应该输出:

refs = soup.find_all('a')
for i in refs:
    if i.has_attr('href'):
        print(i['href'])

答案 1 :(得分:0)

您可以告诉beautifulsoupsoup.find_all('a')查找所有锚标签。然后,您可以使用列表理解功能对其进行过滤并获取链接。

request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')

tags = soup.find_all('a')
tags = [tag for tag in tags if tag.has_attr('href')]
links = [tag['href'] for tag in tags]

links将是['/people/charles-adams']

相关问题