Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Tips / Scrape Crawler

Scrape Crawler

Requests, BeautifulSoup, Scrapy, Selenium, CAPTCHA

HTTP Methods: GET POST PATCH PUT DELETE

BeautifulSoup 編輯網址範例 Reddit Scraper Scrapy Reddit New York Transportation Data pyspider: web crawler

Python 爬蟲: 從 0 到 1 19.5h

pycurl 1. celery 2. 分 crawler, parser 兩段實作。每隻爬蟲要 bind 在不同 IP 上,每 IP 配合每網站一分鐘可以充許一次存取,用 task queue 管理。爬蟲不做 parse,爬回的頁面丟進 kafka 之類的 queue,在其他機器做 parse 後存進資料庫。

DataSet with API

IMDB Metacritic Movie Rating

Travel Fare

requests 可能因 SSL 版本造成錯誤

OECD_ROOT_URL = 'http://stats.oecd.org/sdmx-json/data'

def make_OECD_request(dsname, dimensions, params=None, root_dir=OECD_ROOT_URL):
  if not params:
    params = {}
  dim_args = ['+'.join(d) for d in dimensions]
  dim_str = '.'.join(dim_args)
  url = root_dir + '/' + dsname + '/' + dim_str + '/all'
  print('Requesting URL: ' + url)
  return requests.get(url, params=params)
response = make_OECD_request('QNA',
  (('USA', 'AUS'),('GDP', 'B1_GE'),('CUR', 'VOBARSA'), ('Q')),
  {'startTime':'2009-Q1', 'endTime':'2010-Q1'})
if response.status_code == 200:
  json = response.json()
  json.keys()

BeautifulSoup

在 Windows 使用 open('file.txt', encoding='utf8') 讀取文字檔案,才可能避免 UnicodeDecodeError: 'cp950' 問題。

img src replace

價格排序

Selenium

Controlling the Web with Python Selenium

Selenium for Chrome Extension XPath

By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free Tor Webdriver

bloomberg

Scrapy

scrapyrt

web scraping using Node.js Automating Scraping with JavaScript chrome puppeteer Node.js

urllib

urllib.urlencode vs urllib.quote

captcha

反爬蟲