Published on

Scrapy

Authors

Scrapy Learned

Architecture

Architecture

  • engine

  • spider

  • scheduler

  • downloader

  • item pipelines

Terminology

  • crawl: walk the links

  • parser: parse the webpage html into structured data

  • spider: define how crawler and parser work

  • scraper: extract data from within a webpage.

A crawler gets web pages -- i.e., given a starting address (or set of starting addresses) and some conditions (e.g., how many links deep to go, types of files to ignore) it downloads whatever is linked to from the starting point(s).

A scraper takes pages that have been downloaded [Edit: or, in a more general sense, data that's formatted for display], and (attempts to) extract data from those pages, so that it can (for example) be stored in a database and manipulated as desired.

parse(self, response)

return

  • extract dict

  • Item

  • Request with parse() defined.

Selectors

grammar: response.css(selector) or response.xpath(selector)

  • property: a ::text

  • child level: div.pre_post > a

  • class: div.pre_post

  • id: div#id

functions:

  • extract_first() css selection results are array, return the first element of the results, even though there are only one element in the results.

Scrapy Commands

  • scrapyd-deploy

  • scrapy.cfg

Global commands

  • startproject

  • genspider

  • settings

  • runspider

  • shell

  • fetch

  • view

  • version

Project commands

  • crawl

  • check

  • list

  • edit

  • parse

  • bench

project structure

scrapy.cfg
myproject/
    __init__.py
    items.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

Scrapy Shell

TODOS:

  • response.follow

  • yield result or page

Spider

Selector

The selection methods (.xpath() or .css()) return a list of selectors of the same type, so you can call the selection methods for those selectors too.

  • .xpath('//title/text()')

  • .css('title::text')

  • .re('a*')

In [20]: type(response.selector)
Out[20]: scrapy.selector.unified.Selector

In [21]: type(response.selector.xpath('//title/text()'))
Out[21]: scrapy.selector.unified.SelectorList

In [22]: type(response.selector.xpath('//title/text()')[0])
Out[22]: scrapy.selector.unified.Selector

In [23]:
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
response.css('img').xpath('@src').extract()
response.css('img::attr(src)').extract()
  • extract()

  • extract_first()

Relative Xpath

  • .// relative

  • // absolute

Item

ItemLoader

Scrapy Shell

Item pipeline

Feed Exports

Serialization formats Item exporters

  • JSON

  • JSON lines

  • CSV

  • XML

Requests and Responses

Downloader Middleware

IN settings.py

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

hook

Engine -> mw1 -> mw2 -> mw3 -> ->   -> downloader

       engine
mw1.process_request()
mw2.process_request()
mw3.process_request()
      downloader
mw3.process_response()
mw2.process_response()
mw1.process_response()
       engine
process_request(request, spider)

process_response(request, response, spider)

Sitemap

➜  python curl  http://www.jollychic.com/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="gss.xsl"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_images_1.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_images_9.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_product_1.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_product_2.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz</loc>
        <lastmod>2017-05-08</lastmod>
    </sitemap>
</sitemapindex>
curl http://www.jollychic.com/sitemap/sitemap_tag_1.xml.gz | gzip -d | less
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0">
    <url>
        <priority>0.5</priority>
    </url>
    <url>
        <loc>http://www.jollychic.com/t/long-white-dresses-t5.html</loc>
        <priority>0.5</priority>
    </url>
</urlset>
for index, link in enumerate(links):