当使用scrapy搭建python爬虫时,可能会出现一些常见的错误,如无法安装、错误的依赖关系、配置错误等。下面将介绍一些常见的出错原因和解决方法。
1. 安装错误
在安装scrapy时,可能会出现各种各样的错误。下面列举了一些常见的错误和解决方法:
- 安装失败或者长时间没反应:使用pip安装scrapy时,由于网络问题或者其他原因,可能会出现安装失败的情况。可以尝试使用清华等国内的pip镜像源进行安装,例如:
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
- 安装时缺少依赖库:在安装scrapy时,可能会出现依赖库缺失的情况。可以使用系统包管理器进行安装,例如:
sudo apt-get install libxml2-dev libxslt1-dev python-dev
2. 配置错误
scrapy的配置文件是settings.py,错误的配置会导致scrapy运行失败。下面列举了一些常见的错误和解决方法:
- User-Agent被禁止:在爬取网站时,如果使用了UA被封禁的User-Agent,可能无法正常运行。可以在settings.py中配置User-Agent池来解决这个问题,例如:
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.0.3 Safari/604.5.6',
]
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 500,
'myproxy.middlewares.RandomUserAgentMiddleware': 543, #中间件设置
}
- 导入模块错误:在scrapy的Spider中,可能会出现导入自定义模块错误的情况。可以检查路径和模块名是否正确。
from myspider.items import MyspiderItem
示例说明
- 安装失败:
$ pip install scrapy
Collecting scrapy
Downloading https://files.pythonhosted.org/packages/32/b1/1aebbbb596bd15d4ff69da8aa18986a5d0b18f1eac8239e5ef87f185d2cc/Scrapy-2.0.1-py2.py3-none-any.whl (248kB)
100% |████████████████████████████████| 256kB 587kB/s
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3.6 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-x5glm7eg/psycopg2/setup.py'"'"'; __file__='"'"'/tmp/pip-install-x5glm7eg/psycopg2/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' build_ext -I/usr/local/pgsql/include
cwd: /tmp/pip-install-x5glm7eg/psycopg2/
Complete output (22 lines):
running build_ext
building 'psycopg2._psycopg' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/psycopg
gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DPSYCOPG_DEFAULT_PYDATETIME=1 -DPSYCOPG_VERSION=2.8.5 (dt dec pq3 ext lo64) -DPG_VERSION_NUM=120005 -DHAVE_LO64=1 -I/usr/local/pgsql/include -I/usr/local/include/python3.6m -I/home/jx/demo/env/include/python3.6m -c psycopg/psycopgmodule.c -o build/temp.linux-x86_64-3.6/psycopg/psycopgmodule.o -Wdeclaration-after-statement
解决方法:使用国内清华镜像安装
$ pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy
- 配置错误:
#将这个中间件激活并设为50,数字会影响优先级,数字越小处理优先级越高
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.middleware.SplashCookiesMiddleware': 723,
'scrapy_splash.middleware.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'myproject.middlewares.RandomUserAgentMiddleware': 543,
}
#splash配置
SPLASH_URL = 'http://127.0.0.1:8050'
解决方法:检查配置,确保所有依赖库都正确安装,并且检查相应的变量名、参数名是否有误。
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:scrapy在python爬虫中搭建出错的解决方法 - Python技术站