近期在玩一些爬虫类的东西,其中需要用到Selenium。
稍微简单封装了个Selenium操作工具,后续很可能会用得上,所以这里简单记录下。
这里的封装主要做了两个事情:强制单线程执行Selenium防止内存溢出+浏览器管理,加入Selenium指纹特征屏蔽防止被检测。
# coding:utf8
"""
selenium操作工具
@author Nemo
@time 2022/05/17 11:46
"""
import threading
from selenium import webdriver
from common.utils import resource_utils
from confs import config
# 线程同步锁,防止同时存在多个浏览器对象导致内存溢出
lock = threading.Lock()
# 屏蔽selenium指纹特征脚本路径
stealth_file_path = resource_utils.get_resource_path('/spider/stealth.min.js')
class SeleniumBrowser:
"""
浏览器对象
"""
@staticmethod
def _get_log_options():
"""
获取参数配置
"""
option = webdriver.ChromeOptions()
option.add_argument('--no-sandbox')
option.add_argument('window-size=1920x1080') # 设置浏览器分辨率
option.add_argument("--headless") # 开启无界面模式
option.add_argument("--disable-gpu") # 禁用gpu
option.add_argument('--hide-scrollbars') # 隐藏滚动条,应对一些特殊页面
option.add_argument("--disable-extensions")
option.add_argument("--allow-running-insecure-content")
option.add_argument("--ignore-certificate-errors")
option.add_argument("--disable-single-click-autofill")
option.add_argument("--disable-autofill-keyboard-accessory-view[8]")
option.add_argument("--disable-full-form-autofill-ios")
# 屏蔽一些selenium的指纹特征,防止被检测
option.add_argument('lang=zh_CN.UTF-8')
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
option.add_argument("--disable-blink-features")
option.add_argument("--disable-blink-features=AutomationControlled")
option.add_experimental_option('w3c', False)
return option
def exec_action(self, action, *args):
"""
执行一个浏览器动作
"""
lock.acquire()
try:
driver = self._get_browser()
try:
# 屏蔽一些 selenium 指纹特征,防止被检测
with open(stealth_file_path) as f:
js = f.read()
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": js
})
# 执行操作
return action(driver, *args)
finally:
driver.close()
driver.quit()
finally:
lock.release()
def _get_browser(self):
"""
获取一个浏览器对象,注意此方法必须搭配release_browser方法一起使用
"""
# 驱动路径
driver_path = config.conf.get_selenium_driver_path()
chrome_options = self._get_log_options()
# 实例化带有配置的driver对象
browser = webdriver.Chrome(
executable_path=driver_path,
chrome_options=chrome_options
)
return browser
使用方式:
from selenium import webdriver
def action(driver: Webdriver, user_id):
"""
一些需要做的浏览器操作啥的
"""
browser = SeleniumBrowser()
user_id = 1
browser.exec_action(action, user_id)
这里做了绝大部分屏蔽Selenium指纹检测的操作,不过测试中仍有特征被识别出来,需要调整下驱动里头的关键字:
需调整驱动内容,全局替换cdc_至dcd_以绕过selenium检测:
Linux: sed -i 's/cdc_/dcd_/g' chromedriver
Windows: perl -pi -e ‘s/cdc_/dcd_/g’ chromedriver.exe
Mac: 手动使用vim修改即可
随手贴上一个在前端检测Selenium的规则代码,如果遇到Selenium被检测出来,可以跑一下试试:
runBotDetection = function () {
var documentDetectionKeys = [
"__webdriver_evaluate",
"__selenium_evaluate",
"__webdriver_script_function",
"__webdriver_script_func",
"__webdriver_script_fn",
"__fxdriver_evaluate",
"__driver_unwrapped",
"__webdriver_unwrapped",
"__driver_evaluate",
"__selenium_unwrapped",
"__fxdriver_unwrapped",
];
var windowDetectionKeys = [
"_phantom",
"__nightmare",
"_selenium",
"callPhantom",
"callSelenium",
"_Selenium_IDE_Recorder",
];
for (const windowDetectionKey in windowDetectionKeys) {
const windowDetectionKeyValue = windowDetectionKeys[windowDetectionKey];
if (window[windowDetectionKeyValue]) {
return true;
}
};
for (const documentDetectionKey in documentDetectionKeys) {
const documentDetectionKeyValue = documentDetectionKeys[documentDetectionKey];
if (window['document'][documentDetectionKeyValue]) {
return true;
}
};
for (const documentKey in window['document']) {
if (documentKey.match(/\$[a-z]dc_/) && window['document'][documentKey]['cache_']) {
return true;
}
}
if (window['external'] && window['external'].toString() && (window['external'].toString()['indexOf']('Sequentum') != -1)) return true;
if (window['document']['documentElement']['getAttribute']('selenium')) return true;
if (window['document']['documentElement']['getAttribute']('webdriver')) return true;
if (window['document']['documentElement']['getAttribute']('driver')) return true;
return false;
};
stealth.min.js文件获取方法:
安装nodejs后运行以下命令,自动生成在根目录
npx extract-stealth-evasions
Okk,就先到这里了 ~