Question

我有一个页面，其结构类似于

<body>
    <article>  <!--article no 1-->
        <h3>
        <h2>
            <h1>
                <a>  <!--first 'a' tag-->

        <article> <!--article no 2-->
            <h1>
            <h2>
                <a>  <!--second 'a' tag-->
        </article>       
    </article>
</body>

现在我要提取的是文章中的所有“ a”标签，但没有“ a”标签来自任何嵌套的

即

articles = browser.find_elements_by_tag_name("article")
for i in article:
    print(i.find_elements_by_tag_name("a")

第一篇文章现在，i.find_elements将返回此商品标签内的所有“ a”标签，其中还将在“商品标签”内包含“ a”标签，该标签本身嵌套在当前商品标签中，但我不希望如此

如果我在文章2或任何嵌套文章中的文章1的“ a”标签上调用find_elements，我不希望

Answer 1

如果您希望链接来自非嵌套文章，请尝试：

int rowIdx = 0;
dtDestination.AsEnumerable().All(row => { row["colName"] = dtSource.Rows[rowIdx++]["colName"]; return true; });

Answer 2

使用article解析BeautifulSoup元素，并轻松获得所有锚标记。

from bs4 import BeautifulSoup
articles = browser.find_elements_by_tag_name("article")
links = []
for i in articles:
    soup = BeautifulSoup(i.get_attribute('outerHTML'), 'html5lib')
    a_tags = soup.findAll('a')
    links.extend(a_tags)

希望这会有所帮助！干杯!

Answer 3

使用BeautifulSoup，

尝试查找<a>下的所有<articla>，例如（“ a条”）

然后使用beautifulsoup的find_parents（）方法。

如果（'article a'）。find_parents（'article'）的长度大于2，则可能会这样嵌套。

<article>
  ..
 <article>
    ..
    <a>

因此，如果删除它们，您将得到<a>，只有一个<article>父母

all_a = soup.findAll('article a')

direct_a = [i for i in all_a if len(i)>2]

仅抓取特定标签，而没有来自该特定标签中嵌套标签的详细信息

3 个答案: