- Published on
Scrapy
- Authors
- Name
- Lucas Xu
- @xianminx
Scrapy Learned
Architecture
engine
spider
scheduler
downloader
item pipelines
Terminology
crawl: walk the links
parser: parse the webpage html into structured data
spider: define how crawler and parser work
scraper: extract data from within a webpage.
A crawler
gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).
A scraper
takes pages that have been downloaded [Edit: or, in a more general sense, data that's formatted for display], and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.
parse(self, response)
return
extract dict
Item
Request with parse() defined.
Selectors
grammar: response.css(selector)
or response.xpath(selector)
property:
a ::text
child level:
div.pre_post > a
class:
div.pre_post
id:
div#id
functions:
extract_first()
css selection results are array, return the first element of the results, even though there are only one element in the results.
Scrapy Commands
scrapyd-deploy
scrapy.cfg
Global commands
startproject
genspider
settings
runspider
shell
fetch
view
version
Project commands
crawl
check
list
edit
parse
bench
project structure
scrapy.cfg
myproject/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
Scrapy Shell
TODOS:
response.follow
yield result or page
Spider
Selector
The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too.
.xpath('//title/text()')
.css('title::text')
.re('a*')
In [20]: type(response.selector)
Out[20]: scrapy.selector.unified.Selector
In [21]: type(response.selector.xpath('//title/text()'))
Out[21]: scrapy.selector.unified.SelectorList
In [22]: type(response.selector.xpath('//title/text()')[0])
Out[22]: scrapy.selector.unified.Selector
In [23]:
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
response.css('img').xpath('@src').extract()
response.css('img::attr(src)').extract()
extract()
extract_first()
Relative Xpath
.//
relative//
absolute
Item
ItemLoader
Scrapy Shell
Item pipeline
Feed Exports
Serialization formats Item exporters
JSON
JSON lines
CSV
XML
Requests and Responses
Link Extractors
Downloader Middleware
IN settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
hook
Engine -> mw1 -> mw2 -> mw3 -> -> -> downloader
engine
↓
mw1.process_request()
↓
mw2.process_request()
↓
mw3.process_request()
↓
↓
downloader
↓
↓
mw3.process_response()
↓
mw2.process_response()
↓
mw1.process_response()
↓
engine
process_request(request, spider)
process_response(request, response, spider)
Sitemap
➜ python curl http://www.jollychic.com/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="gss.xsl"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_images_1.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_images_9.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_product_1.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_product_2.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
<sitemap>
<loc>http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz</loc>
<lastmod>2017-05-08</lastmod>
</sitemap>
</sitemapindex>
curl http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz | gzip -d | less
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0">
<url>
<priority>0.5</priority>
</url>
<url>
<loc>http://www.jollychic.com/t/long-white-dresses-t5.html</loc>
<priority>0.5</priority>
</url>
</urlset>
for index, link in enumerate(links):