菜鸟笔记
提升您的技术认知

Beautiful Soup关联选择的方法

阅读 : 92

Beautiful Soup是Python的一个网页解析库,处理快捷; 支持多种解析器,功能强大。教程细致讲解Beautiful Soup的深入使用、节点选择器、CSS选择器、Beautiful Soup4的方法选择器等重要知识点,是学好爬虫的基础课程。

节点选择器——关联选择的方法

学习目标

  1. 掌握关联选择的方法的使用

1. 关联选择

在做选择的时候,有时候不能做到一步就选到想要的节点元素,例如示例中的第二个a节点

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

需要先选中某一个节点元素,然后以它为基准再选择它的子节点、父节点、兄弟节点等,接下来我们来介绍如何选择这些节点元素。

1. 子节点

  • 格式:soup.tag.contents

  • 返回值:列表

  • 示例:

    html = '''
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        
        <div class="panel-body">
            <ul class="list" id="list-1">
                <li class="element"> Foo</li>
                <li class="element">Bar</li>
                <li class="element">]ay</li>
            </ul>
            
            <ul class="list list-small" id="list-2">
                <li class="element"> Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
    '''
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    # 获取p节点的子节点
    print(soup.p.contents)
      
    # 输出结果
    ['Once upon a time there were three little sisters; and their names were\n', <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>, ',\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';\nand they lived at the bottom of a well.']
    
  • 格式:soup.tag.children

  • 返回值:生成器

  • 示例:

    html = '''
      <div class="panel">
          <div class="panel-heading">
              <h4>Hello</h4>
          </div>
          
          <div class="panel-body">
              <ul class="list" id="list-1">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
                  <li class="element">]ay</li>
              </ul>
              
              <ul class="list list-small" id="list-2">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
              </ul>
          </div>
      </div>
      '''
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    # 获取p节点的每一个子节点
    print(soup.p.children)
    for i, child in enumerate(soup.p.children):
      	print(i, child)
        
    # 输出结果
    <list_iterator object at 0x105e6c128>
    0 Once upon a time there were three little sisters; and their names were
    
    1 <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
    2 ,
    
    3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    4  and
    
    5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    6 ;
    and they lived at the bottom of a well.
    

2. 子孙节点

上面我们已经拿到了p节点的全部直系的子节点,如果我们想要获取p节点中的所有子孙节点的话,可以使用descendants属性。

  • 格式:soup.p.descendants

  • 返回值:生成器

  • 示例:

    html = '''
      <div class="panel">
          <div class="panel-heading">
              <h4>Hello</h4>
          </div>
          
          <div class="panel-body">
              <ul class="list" id="list-1">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
                  <li class="element">]ay</li>
              </ul>
              
              <ul class="list list-small" id="list-2">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
              </ul>
          </div>
      </div>
      '''
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    # 获取p节点所有的子孙节点
    print(soup.p.descendants)
    for i, child in enumerate(soup.p.descendants):
        print(i, child)
    
    # 输出结果
    <generator object descendants at 0x102334468>
    0 Once upon a time there were three little sisters; and their names were
    
    1 <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
    2 <span>Elsie</span>
    3 Elsie
    4 ,
    
    5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    6 Lacie
    7  and
    
    8 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    9 Tillie
    10 ;
    and they lived at the bottom of a well.
    

3. 父节点

上面我们都是在选择子节点和子孙节点,接下来我们使用parent属性获取某节点元素的父节点。

  • 格式:soup.tag.parent

  • 返回值:节点元素

  • 示例:

    html = '''
      <div class="panel">
          <div class="panel-heading">
              <h4>Hello</h4>
          </div>
          
          <div class="panel-body">
              <ul class="list" id="list-1">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
                  <li class="element">]ay</li>
              </ul>
              
              <ul class="list list-small" id="list-2">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
              </ul>
          </div>
      </div>
      '''
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    # 打印选取的a节点
    print(soup.a)
    # 获取a节点的父节点
    print(soup.a.parent)
    
    # 输出结果
    # 选取的a节点
    <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>
    # 所选a节点的父节点
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    

4. 祖先节点

如果想要获取,祖先节点,可以调用parents属性。

  • 格式:soup.tag.parents

  • 返回值:生成器

  • 示例:

    html = '''
      <div class="panel">
          <div class="panel-heading">
              <h4>Hello</h4>
          </div>
          
          <div class="panel-body">
              <ul class="list" id="list-1">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
                  <li class="element">]ay</li>
              </ul>
              
              <ul class="list list-small" id="list-2">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
              </ul>
          </div>
      </div>
      '''
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    # 获取a节点的所有祖先节点
    print(soup.a.parents)
    # 打印a节点的所有祖先节点的类型
    print(type(soup.a.parents))
    # 获取a节点的所有祖先节点的内容
    print(list(enumerate(soup.a.parents)))
    
    # 输出结果
    <generator object parents at 0x102434468>
    <class 'generator'>
    [(0, <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>), (1, <body>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body>), 
     (2, <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body></html>), 
     (3, <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body></html>)]
    

5. 兄弟节点

上面说明了子节点和父节点的获取方式,那如果想要获取同级的节点,应该怎么办呢?接下来我们来学习下,使用sibling获取兄弟节点。

  • 获取后面一个节点

  • 格式:soup.tag.next_sibling

  • 返回值:节点元素

  • 示例:

    html = '''
      <div class="panel">
          <div class="panel-heading">
              <h4>Hello</h4>
          </div>
          
          <div class="panel-body">
              <ul class="list" id="list-1">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
                  <li class="element">]ay</li>
              </ul>
              
              <ul class="list list-small" id="list-2">
                  <li class="element"> Foo</li>
                  <li class="element">Bar</li>
              </ul>
          </div>
      </div>
      '''
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'lxml')
    # 获取a节点的后面一个节点
    print(soup.a.next_sibling)
    # 获取类型
    print(type(soup.a.next_sibling))
    
    # 输出结果
    ,
    
    <class 'bs4.element.NavigableString'>
    
  • 获取后面所有的节点

  • 格式:soup.tag.next_siblings

  • 返回值:生成器

  • 示例:

    # 获取a节点的后面所有节点
    print(soup.a.next_siblings)
    # 获取类型
    print(type(soup.a.next_siblings))
    # 获取所有内容
    print(list(enumerate(soup.a.next_siblings)))
    
    # 输出结果
    <generator object next_siblings at 0x102434468>
    <class 'generator'>
    [(0, ',\n'), 
     (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), 
     (2, ' and\n'), 
     (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), 
     (4, ';\nand they lived at the bottom of a well.')]
    
  • 获取前面一个节点

  • 格式:soup.tag.previous_sibling

  • 返回值:节点元素

  • 示例:

    # 获取a节点的前一个节点
    print(soup.a.previous_sibling)
    # 获取类型
    print(type(soup.a.previous_sibling))
    
    # 输出结果
    Once upon a time there were three little sisters; and their names were
    
    <class 'bs4.element.NavigableString'>
    
  • 获取前面的所有节点

  • 格式:soup.tag.previous_siblings

  • 返回值:生成器

  • 示例:

    # 获取a节点的前面所有节点
    print(soup.a.previous_siblings)
    # 获取类型
    print(type(soup.a.previous_siblings))
    # 获取所有内容
    print(list(enumerate(soup.a.previous_siblings)))
    
    # 输出结果
    <generator object previous_siblings at 0x102334468>
    <class 'generator'>
    [(0, 'Once upon a time there were three little sisters; and their names were\n')]
    

2. 总结

节点选择器 关联选择方法:

  • 子节点
    • soup.tag.contents
    • soup.tag.children
  • 子孙节点
    • soup.tag.descendants
  • 父节点
    • soup.tag.parent
  • 祖先节点
    • soup.tag.parents
  • 兄弟节点
    • soup.tag.next_sibling
    • soup.tag.next_siblings
    • soup.tag…previous_sibling
    • soup.tag…previous_siblings