Scrape Crawler

Requests, BeautifulSoup, Scrapy, Selenium, CAPTCHA

HTTP Methods: GET POST PATCH PUT DELETE

BeautifulSoup 編輯網址範例 Reddit Scraper Scrapy Reddit New York Transportation Data pyspider: web crawler

準備開發環境: Anaconda + Selenium + Scrapy
利用 Requests 執行基本 Get Post 操作
操作 API 與 JSON 資料範例
蛋白質序列自動化處理
利用 CSV 或 SQL 儲存資料
利用 Cookie 和 Proxy Pool 來登入驗證: multithread-proxy 站大爺
正規表示式篩選資料
Scrapy 爬蟲框架及 Log 記錄檔 github video
Scrapy 分頁處理
Scrapy-Redis 分散式爬取
Selenium 自動化資料分析 github video
Selenium 依 CSS XPATH 搜尋元素
RoboBrowser github video

pycurl 1. celery 2. 分 crawler, parser 兩段實作。每隻爬蟲要 bind 在不同 IP 上，每 IP 配合每網站一分鐘可以充許一次存取，用 task queue 管理。爬蟲不做 parse，爬回的頁面丟進 kafka 之類的 queue，在其他機器做 parse 後存進資料庫。

DataSet with API

IMDB Metacritic Movie Rating

Travel Fare

requests 可能因 SSL 版本造成錯誤

OECD_ROOT_URL = 'http://stats.oecd.org/sdmx-json/data'

def make_OECD_request(dsname, dimensions, params=None, root_dir=OECD_ROOT_URL):
  if not params:
    params = {}
  dim_args = ['+'.join(d) for d in dimensions]
  dim_str = '.'.join(dim_args)
  url = root_dir + '/' + dsname + '/' + dim_str + '/all'
  print('Requesting URL: ' + url)
  return requests.get(url, params=params)

response = make_OECD_request('QNA',
  (('USA', 'AUS'),('GDP', 'B1_GE'),('CUR', 'VOBARSA'), ('Q')),
  {'startTime':'2009-Q1', 'endTime':'2010-Q1'})

if response.status_code == 200:
  json = response.json()
  json.keys()

BeautifulSoup

在 Windows 使用 open('file.txt', encoding='utf8') 讀取文字檔案，才可能避免 UnicodeDecodeError: 'cp950' 問題。

img src replace

價格排序

Selenium

Controlling the Web with Python Selenium

Selenium for Chrome Extension XPath

By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free Tor Webdriver

bloomberg

Scrapy

scrapyrt

web scraping using Node.js Automating Scraping with JavaScript chrome puppeteer Node.js

urllib

urllib.urlencode vs urllib.quote

captcha

反爬蟲