爬虫技术栈

💡

爬虫（Web Crawler）是自动化获取网页数据的工具。在现代 Web 开发中，Selenium 和 Playwright 是最主流的两个浏览器自动化框架。本文档系统解析两者的核心 API，并做全面对比，帮助你根据场景选择合适的工具。

一、Selenium 框架

1.1 概述

Selenium 是最早的浏览器自动化框架之一，支持几乎所有浏览器（Chrome、Firefox、Safari、Edge 等）和几乎所有编程语言（Python、Java、C#、JavaScript、Ruby 等），是自动化测试和爬虫领域的"老牌工具"。

pip install selenium

1.2 浏览器驱动初始化

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

# 配置 Chrome 选项
options = Options()
options.add_argument('--headless')            # 无头模式
options.add_argument('--no-sandbox')          # 禁用沙箱
options.add_argument('--disable-gpu')         # 禁用 GPU
options.add_argument('--window-size=1920,1080')
options.add_argument('--user-agent=Mozilla/5.0...')  # 自定义 UA

# 反检测：隐藏自动化标记
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)

# 创建驱动
driver = webdriver.Chrome(options=options)
# 或指定驱动路径
# driver = webdriver.Chrome(service=Service('./chromedriver'), options=options)

1.3 页面导航

# 打开页面
driver.get('https://example.com')

# 获取当前 URL
current_url = driver.current_url

# 获取页面标题
title = driver.title

# 获取页面源码
page_source = driver.page_source

# 前进/后退/刷新
driver.back()
driver.forward()
driver.refresh()

# 关闭当前标签页 / 关闭浏览器
driver.close()
driver.quit()

1.4 元素定位（8种方式）

from selenium.webdriver.common.by import By

# 1. ID 定位（最精确，推荐）
element = driver.find_element(By.ID, 'username')

# 2. Name 定位
element = driver.find_element(By.NAME, 'password')

# 3. Class Name 定位
element = driver.find_element(By.CLASS_NAME, 'btn-submit')

# 4. Tag Name 定位
element = driver.find_element(By.TAG_NAME, 'input')

# 5. CSS Selector 定位（最灵活，推荐）
element = driver.find_element(By.CSS_SELECTOR, '#login-form > input.email')
element = driver.find_element(By.CSS_SELECTOR, 'div.container > ul > li:first-child')

# 6. XPath 定位
element = driver.find_element(By.XPATH, '//input[@name="username"]')
element = driver.find_element(By.XPATH, '//div[@class="item"][2]/span')

# 7. Link Text（完整匹配链接文本）
element = driver.find_element(By.LINK_TEXT, '查看更多')

# 8. Partial Link Text（部分匹配链接文本）
element = driver.find_element(By.PARTIAL_LINK_TEXT, '更多')

# 查找多个元素（返回列表）
elements = driver.find_elements(By.CSS_SELECTOR, '.list-item')

1.5 元素交互

# 输入文本
element.send_keys('hello world')

# 清空输入框
element.clear()

# 点击
element.click()

# 获取属性
value = element.get_attribute('href')
class_name = element.get_attribute('class')

# 获取文本
text = element.text

# 获取标签名
tag = element.tag_name

# 判断元素是否可见/可交互
is_displayed = element.is_displayed()
is_enabled = element.is_enabled()
is_selected = element.is_selected()

# 获取元素尺寸和位置
size = element.size       # {'width': 100, 'height': 50}
location = element.location  # {'x': 200, 'y': 300}

# 截图（元素截图 / 全页面截图）
element.screenshot('element.png')
driver.save_screenshot('fullpage.png')

1.6 等待机制（关键）

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# === 强制等待（不推荐） ===
import time
time.sleep(3)

# === 隐式等待（全局设置，一次配置全程生效） ===
driver.implicitly_wait(10)  # 最多等10秒

# === 显式等待（推荐，精确控制） ===
wait = WebDriverWait(driver, 10)

# 等待元素出现
element = wait.until(EC.presence_of_element_located((By.ID, 'content')))

# 等待元素可见
element = wait.until(EC.visibility_of_element_located((By.CLASS_NAME, 'result')))

# 等待元素可点击
element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn')))

# 等待文本出现
element = wait.until(EC.text_to_be_present_in_element((By.ID, 'msg'), '成功'))

# 等待元素消失
wait.until(EC.invisibility_of_element((By.CLASS_NAME, 'loading')))

# 常用 EC 条件：
# - title_contains / title_is
# - presence_of_all_elements_located
# - alert_is_present()
# - frame_to_be_available_and_switch_to_it

# 获取所有 Cookie
cookies = driver.get_cookies()

# 获取单个 Cookie
cookie = driver.get_cookie('session_id')

# 添加 Cookie
driver.add_cookie({'name': 'token', 'value': 'abc123'})

# 删除 Cookie
driver.delete_cookie('token')
driver.delete_all_cookies()

1.8 JS 执行

# 执行 JS 代码
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')

# 滚动到元素可见
element = driver.find_element(By.ID, 'target')
driver.execute_script('arguments[0].scrollIntoView(true);', element)

# 修改元素属性
driver.execute_script("arguments[0].setAttribute('value', 'new_value')", element)

# 获取返回值
result = driver.execute_script('return document.title')

1.9 窗口/标签页管理

# 获取当前窗口句柄
current = driver.current_window_handle

# 获取所有窗口句柄
all_windows = driver.window_handles

# 切换窗口
driver.switch_to.window(all_windows[1])

# 打开新标签页
driver.switch_to.new_window('tab')
driver.switch_to.new_window('window')

# 窗口最大化/设置大小
driver.maximize_window()
driver.set_window_size(1024, 768)

1.10 Frame/Alert 处理

# 切换到 iframe
driver.switch_to.frame('frame_name_or_id')
driver.switch_to.frame(driver.find_element(By.TAG_NAME, 'iframe'))
# 切回主文档
driver.switch_to.default_content()

# 处理 Alert 弹窗
alert = driver.switch_to.alert
alert.accept()       # 确认
alert.dismiss()      # 取消
text = alert.text    # 获取弹窗文本
alert.send_keys('输入内容')

1.11 ActionChains（鼠标/键盘操作）

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

actions = ActionChains(driver)

# 鼠标悬停
actions.move_to_element(element).perform()

# 右键点击
actions.context_click(element).perform()

# 双击
actions.double_click(element).perform()

# 拖拽
actions.drag_and_drop(source, target).perform()

# 键盘操作
from selenium.webdriver.common.keys import Keys
element.send_keys(Keys.ENTER)
element.send_keys(Keys.CONTROL + 'a')    # 全选
element.send_keys(Keys.CONTROL + 'c')    # 复制

# 组合键
actions.key_down(Keys.CONTROL).send_keys('c').key_up(Keys.CONTROL).perform()

1.12 Select 下拉框

from selenium.webdriver.support.select import Select

select_element = driver.find_element(By.ID, 'city')
select = Select(select_element)

select.select_by_value('beijing')        # 按 value
select.select_by_visible_text('北京')     # 按显示文本
select.select_by_index(1)                # 按索引

# 获取所有选项
options = select.options
# 获取已选项
selected = select.first_selected_option

二、Playwright 框架

2.1 概述

Playwright 由微软开发，是新一代浏览器自动化框架。它支持 Chromium、Firefox、WebKit 三大引擎，内置自动等待和网络拦截，API 设计更加现代化。原生支持 Python、Node.js、Java、.NET。

pip install playwright
playwright install   # 自动下载浏览器

2.2 浏览器启动

from playwright.sync_api import sync_playwright
# 异步版本：from playwright.async_api import async_playwright

with sync_playwright() as p:
    # 启动浏览器（3种模式）
    browser = p.chromium.launch(headless=True)          # 无头模式
    # browser = p.chromium.launch(headless=False)       # 有头模式
    # browser = p.chromium.launch(channel='chrome')     # 使用本地 Chrome

    # 创建上下文（隔离的浏览器会话，类似隐身窗口）
    context = browser.new_context(
        viewport={'width': 1920, 'height': 1080},
        user_agent='Mozilla/5.0...',
        locale='zh-CN',
        timezone_id='Asia/Shanghai',
    )

    # 创建页面
    page = context.new_page()
    page.goto('https://example.com')

    # 清理
    context.close()
    browser.close()

2.3 页面导航与控制

# 导航
page.goto('https://example.com', wait_until='networkidle')  # 网络空闲时
# wait_until: 'load' | 'domcontentloaded' | 'networkidle' | 'commit'

# 获取信息
url = page.url
title = page.title()
content = page.content()

# 前进后退
page.go_back()
page.go_forward()
page.reload()

# 截图
page.screenshot(path='screenshot.png', full_page=True)

# 生成 PDF
page.pdf(path='page.pdf')

# 关闭
page.close()

2.4 元素定位（推荐 locator 方式）

# === Locator API（推荐，Playwright 核心方式） ===

# 文本匹配
page.locator('text=登录').click()
page.locator('text=确 认').click()           # 自动处理空格
page.locator('button:has-text("提交")').click()  # 部分匹配

# CSS Selector
page.locator('#username').fill('admin')
page.locator('.form input[name="email"]').fill('test@qq.com')

# XPath
page.locator('//button[@type="submit"]').click()

# 组合筛选
page.locator('li').filter(has_text='Python').click()
page.locator('li').filter(has=page.locator('.badge')).count()

# 位置筛选
page.locator('button').first.click()
page.locator('button').last.click()
page.locator('button').nth(2).click()

# === 旧式 ElementHandle API（不推荐） ===
page.query_selector('#id')
page.query_selector_all('.class')

2.5 元素交互

# 输入文本（自动清空后输入）
page.locator('#username').fill('hello world')

# 逐字输入（模拟人类打字）
page.locator('#username').type('hello', delay=100)

# 清空
page.locator('#username').clear()

# 点击（自动等待可点击）
page.locator('button').click()
page.locator('button').dblclick()             # 双击
page.locator('button').click(button='right')  # 右键

# 获取文本
text = page.locator('h1').text_content()
inner_text = page.locator('h1').inner_text()

# 获取属性
value = page.locator('input').get_attribute('value')

# 获取输入值
input_value = page.locator('input').input_value()

# 复选框/单选框
page.locator('#agree').check()
page.locator('#agree').uncheck()
page.locator('#agree').is_checked()

# 下拉选择
page.locator('select').select_option('beijing')        # 按 value
page.locator('select').select_option(label='北京')      # 按 label
page.locator('select').select_option(index=1)           # 按 index

2.6 自动等待机制（Playwright 最大优势）

📌

Playwright 内置自动等待，无需手动写等待代码。每次操作前自动检查元素是否：可操作（actionable）、可见（visible）、稳定（stable）、接收事件（receives events）、未动画中（not animating）。

# Playwright 自动等待这些条件：
# 1. 元素存在
# 2. 元素可见
# 3. 元素稳定（动画结束）
# 4. 元素接收事件（不被遮挡）
# 5. 元素可用（enable）

# 如需要显式等待：
page.wait_for_selector('.result', state='visible')     # 等待元素可见
page.wait_for_selector('.loading', state='hidden')      # 等待元素隐藏
page.wait_for_selector('.item', state='attached')       # 等待元素在DOM中
page.wait_for_selector('.item', state='detached')       # 等待元素移出DOM

# 等待网络
page.wait_for_load_state('networkidle')                 # 网络空闲
page.wait_for_load_state('domcontentloaded')            # DOM加载完成

# 等待超时
page.wait_for_timeout(3000)                             # 等待3秒

# 等待函数
page.wait_for_function('() => window.scrollY > 1000')

# 等待请求/响应
page.wait_for_request('**/api/login')
page.wait_for_response('**/api/data')

2.7 网络拦截与 Mock

# 拦截请求
page.route('**/*.{png,jpg,jpeg,gif}', lambda route: route.abort())  # 屏蔽图片

# 修改请求
def handle_route(route):
    headers = {**route.request.headers, 'X-Custom': 'value'}
    route.continue_(headers=headers)
page.route('**/api/**', handle_route)

# Mock 响应
page.route('**/api/user', lambda route: route.fulfill(
    status=200,
    content_type='application/json',
    body='{"name":"test","age":25}'
))

# 取消拦截
page.unroute('**/api/**')

2.8 网络请求监听

# 监听所有请求
page.on('request', lambda request: print(f'>> {request.method} {request.url}'))

# 监听所有响应
page.on('response', lambda response: print(f'<< {response.status} {response.url}'))

# 捕获特定请求数据
responses = []
page.on('response', lambda res: responses.append(res) if 'api' in res.url else None)

page.goto('https://example.com')
# 后续处理 responses 中的数据...

# Cookies
cookies = context.cookies()
context.add_cookies([{'name': 'token', 'value': 'abc', 'url': 'https://example.com'}])
context.clear_cookies()

# localStorage / sessionStorage
page.evaluate('() => localStorage.setItem("key", "value")')
value = page.evaluate('() => localStorage.getItem("key")')

2.10 JS 执行

# 执行 JS
result = page.evaluate('document.title')

# 传参
page.evaluate('(arg) => console.log(arg)', {'name': 'test'})

# 访问元素
page.evaluate('(el) => el.textContent', element_handle)

# 在元素上下文中执行
element = page.locator('.counter')
text = element.evaluate('el => el.textContent')

2.11 多页面/标签页

# 监听新页面打开
with context.expect_page() as new_page_info:
    page.locator('a[target="_blank"]').click()
new_page = new_page_info.value

# 获取所有页面
pages = context.pages

# 打开新标签页
new_page = context.new_page()

2.12 Frame 处理

# 定位 frame
frame = page.frame(name='frame-name')
frame = page.frame(url='**/iframe.html')

# frame 内的元素操作
frame.locator('button').click()

# 获取所有 frame
frames = page.frames

# 切换到主 frame
# page 本身就是主 frame，不需要切换

2.13 对话框处理

# 自动处理（默认）
page.on('dialog', lambda dialog: dialog.accept())
page.on('dialog', lambda dialog: dialog.dismiss())

# 获取弹出消息
page.on('dialog', lambda dialog: print(dialog.message))

2.14 鼠标/键盘操作

# 鼠标
page.mouse.move(100, 200)
page.mouse.click(100, 200)
page.mouse.dblclick(100, 200)
page.mouse.down()
page.mouse.up()

# 键盘
page.keyboard.press('Enter')
page.keyboard.press('Control+A')
page.keyboard.type('hello world', delay=100)

# 拖拽
page.locator('#source').drag_to(page.locator('#target'))

2.15 录制与代码生成

# Playwright 独有的代码生成器（录制操作自动生成代码）
playwright codegen https://example.com

# 指定语言
playwright codegen --target python https://example.com
playwright codegen --target python-pytest https://example.com

三、Selenium vs Playwright 全面对比

3.1 API 对比速查表

功能	Selenium	Playwright
打开页面	`driver.get(url)`	`page.goto(url)`
元素定位	`find_element(By.ID, 'x')`	`page.locator('#x')`
输入文本	`element.send_keys('text')`	`page.locator('#x').fill('text')`
点击	`element.click()`	`page.locator('btn').click()`
等待元素	`WebDriverWait(driver,10).until(EC...)`	内置自动等待
截图	`driver.save_screenshot('a.png')`	`page.screenshot(path='a.png',full_page=True)`
执行 JS	`driver.execute_script('...')`	`page.evaluate('...')`
网络拦截	需通过代理（BrowserMob）	`page.route()` 一行搞定
多标签页	`driver.switch_to.window(h)`	`context.pages` 天然支持
iframe	`driver.switch_to.frame(el)`	`page.frame(name='x')` 无需切换
文件上传	`send_keys(path)`	`page.locator('input').set_input_files(path)`
代码生成	需第三方插件	`playwright codegen` 内置

3.2 核心差异

❗

Selenium：通过 WebDriver 协议与浏览器通信，每一条操作都是一次 HTTP 请求，有网络开销。 Playwright：通过 Chrome DevTools Protocol (CDP) 直接与浏览器通信，比 WebDriver 更快更稳定。

维度	Selenium	Playwright
通信协议	WebDriver（HTTP）	CDP（WebSocket）
自动等待	需手动写 WebDriverWait	内置自动等待机制
浏览器支持	Chrome / Firefox / Safari / Edge / IE	Chromium / Firefox / WebKit
语言支持	Python / Java / C# / JS / Ruby / PHP	Python / JS / Java / .NET
网络拦截	弱（需第三方工具）	强（page.route() 原生支持）
多页面/标签	手动切换 handle	context.pages 天然管理
iframe 处理	需切换上下文	直接定位，无需切换
录制回放	第三方插件（Selenium IDE）	内置 `playwright codegen`
移动端模拟	弱	原生支持 device emulation
API 测试	不支持	内置 `request` 模块（可直接测 API）
调试工具	基础日志	Playwright Inspector / Trace Viewer
速度	较慢（每条命令 = HTTP 请求）	快（WebSocket 长连接）
稳定性	容易出现 flaky 测试	flaky 测试极少
学习曲线	中等	较低（API 更现代）

3.3 优势/劣势总结

Selenium

优势：

浏览器兼容性最广（包括 IE）
语言支持最丰富（6+ 种语言）
社区成熟，资料丰富，历史最久
企业中使用广泛，大量现存项目
支持分布式执行（Selenium Grid）

劣势：

速度慢（每条命令 = 一次 HTTP 请求）
需要手动管理等待（WebDriverWait）
网络拦截能力弱
iframe/多窗口切换繁琐
容易写出不稳定（flaky）的用例

Playwright

优势：

速度快（WebSocket 长连接，批量操作）
内置自动等待，flaky 问题极少
网络拦截强大（route / fulfill）
API 更现代化、更简洁
内置代码生成器（playwright codegen）
原生支持移动端模拟
Trace Viewer 可以录制和回放每一步操作

劣势：

不支持 IE 浏览器
语言支持较少（仅4种）
相对较新，社区规模不如 Selenium
企业遗留项目迁移成本高

3.4 选型建议

场景	推荐
新项目爬虫/自动化测试	Playwright（更快更稳定）
需要支持 IE	Selenium
已有 Selenium 项目	继续使用，逐步迁移
需 Java / C# 且要现代化	Playwright（已支持 Java/.NET）
爬虫需拦截/修改网络请求	Playwright（route 功能太强）
团队不熟悉新工具	Selenium（资源多、上手快）
移动端 Web 自动化	Playwright（原生 device 模拟）

✅

总结：如果是新项目，推荐优先选择 Playwright——它的自动等待、网络拦截、现代化 API 可以极大提升开发效率。如果已有大量 Selenium 代码，或必须支持 IE，则继续使用 Selenium。两者核心概念相似，掌握一个后迁移到另一个成本不高。