Nemo

Nemo 关注TA

路漫漫其修远兮,吾将上下而求索。

Nemo

Nemo

关注TA

路漫漫其修远兮,吾将上下而求索。

  •  普罗旺斯
  • 负责帅就完事了
  • 写了1,496,113字

该文章投稿至Nemo社区   Python  板块 复制链接


python Selenium 操作工具封装:反反爬虫+内存管理

发布于 2022/06/10 16:27 13,626浏览 0回复 4,938

近期在玩一些爬虫类的东西,其中需要用到Selenium。

稍微简单封装了个Selenium操作工具,后续很可能会用得上,所以这里简单记录下。

这里的封装主要做了两个事情:强制单线程执行Selenium防止内存溢出+浏览器管理,加入Selenium指纹特征屏蔽防止被检测。

# coding:utf8
"""
selenium操作工具
@author Nemo
@time 2022/05/17 11:46
"""
import threading

from selenium import webdriver

from common.utils import resource_utils
from confs import config

# 线程同步锁,防止同时存在多个浏览器对象导致内存溢出
lock = threading.Lock()

# 屏蔽selenium指纹特征脚本路径
stealth_file_path = resource_utils.get_resource_path('/spider/stealth.min.js')


class SeleniumBrowser:
"""
浏览器对象
"""

@staticmethod
def _get_log_options():
"""
获取参数配置
"""
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('window-size=1920x1080') # 设置浏览器分辨率
option.add_argument("--headless") # 开启无界面模式
option.add_argument("--disable-gpu") # 禁用gpu
option.add_argument('--hide-scrollbars') # 隐藏滚动条,应对一些特殊页面
option.add_argument("--disable-extensions")
option.add_argument("--allow-running-insecure-content")
option.add_argument("--ignore-certificate-errors")
option.add_argument("--disable-single-click-autofill")
option.add_argument("--disable-autofill-keyboard-accessory-view[8]")
option.add_argument("--disable-full-form-autofill-ios")

# 屏蔽一些selenium的指纹特征,防止被检测
option.add_argument('lang=zh_CN.UTF-8')
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
option.add_argument("--disable-blink-features")
option.add_argument("--disable-blink-features=AutomationControlled")

option.add_experimental_option('w3c', False)
return option

def exec_action(self, action, *args):
"""
执行一个浏览器动作
"""
lock.acquire()
try:
driver = self._get_browser()
try:
# 屏蔽一些 selenium 指纹特征,防止被检测
with open(stealth_file_path) as f:
js = f.read()
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": js
})

# 执行操作
return action(driver, *args)
finally:
driver.close()
driver.quit()
finally:
lock.release()

def _get_browser(self):
"""
获取一个浏览器对象,注意此方法必须搭配release_browser方法一起使用
"""
# 驱动路径
driver_path = config.conf.get_selenium_driver_path() chrome_options = self._get_log_options()
# 实例化带有配置的driver对象
browser = webdriver.Chrome(
executable_path=driver_path,
chrome_options=chrome_options
)
return browser


使用方式:

from selenium import webdriver

def action(driver: Webdriver, user_id):
""" 一些需要做的浏览器操作啥的
"""


browser = SeleniumBrowser() user_id = 1 browser.exec_action(action, user_id)


这里做了绝大部分屏蔽Selenium指纹检测的操作,不过测试中仍有特征被识别出来,需要调整下驱动里头的关键字:

需调整驱动内容,全局替换cdc_至dcd_以绕过selenium检测:

Linux: sed -i 's/cdc_/dcd_/g' chromedriver
Windows: perl -pi -e ‘s/cdc_/dcd_/g’ chromedriver.exe
Mac: 手动使用vim修改即可


随手贴上一个在前端检测Selenium的规则代码,如果遇到Selenium被检测出来,可以跑一下试试:

runBotDetection = function () {
var documentDetectionKeys = [
"__webdriver_evaluate",
"__selenium_evaluate",
"__webdriver_script_function",
"__webdriver_script_func",
"__webdriver_script_fn",
"__fxdriver_evaluate",
"__driver_unwrapped",
"__webdriver_unwrapped",
"__driver_evaluate",
"__selenium_unwrapped",
"__fxdriver_unwrapped",
];

var windowDetectionKeys = [
"_phantom",
"__nightmare",
"_selenium",
"callPhantom",
"callSelenium",
"_Selenium_IDE_Recorder",
];

for (const windowDetectionKey in windowDetectionKeys) {
const windowDetectionKeyValue = windowDetectionKeys[windowDetectionKey];
if (window[windowDetectionKeyValue]) {
return true;
}
};
for (const documentDetectionKey in documentDetectionKeys) {
const documentDetectionKeyValue = documentDetectionKeys[documentDetectionKey];
if (window['document'][documentDetectionKeyValue]) {
return true;
}
};

for (const documentKey in window['document']) {
if (documentKey.match(/\$[a-z]dc_/) && window['document'][documentKey]['cache_']) {
return true;
}
}

if (window['external'] && window['external'].toString() && (window['external'].toString()['indexOf']('Sequentum') != -1)) return true;

if (window['document']['documentElement']['getAttribute']('selenium')) return true;
if (window['document']['documentElement']['getAttribute']('webdriver')) return true;
if (window['document']['documentElement']['getAttribute']('driver')) return true;

return false;
};


stealth.min.js文件获取方法:
安装nodejs后运行以下命令,自动生成在根目录

  npx extract-stealth-evasions


Okk,就先到这里了 ~


本文标签
 {{tag}}
点了个评