Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Tips / Data Analysis

Data Analysis

Pandas, Numpy, Scientific Stack

https://medium.com/@trrhodes/validating-python-data-with-cerberus-374447bd3cbe  https://medium.com/@zachary.bedell/image-recognition-using-tensorflow-and-probability-52f0e35de198 https://medium.com/@james_aka_yale/convolutional-neural-networks-the-biologically-inspired-model-9b7e948f6987 https://medium.com/@farhadmalik84/what-is-a-container-architecture-design-54826e93fc18 https://medium.com/@aryanmisra/capsule-networks-the-new-deep-learning-network-bd917e6818e8 https://medium.com/@anirudheka/a-functional-programming-guide-to-happy-4988dc6c1764

Agile Data Science (2019/01/15): Avro gmail.py Pig MongoDB ElasticSearch Bootstrap Flask D3 | AWS S3

ways-to-improve-a-map-visualization exploratory-statistical-data-analysis-with-a-real-dataset-using-pandas

From Scratch: 常見問題 1) 圖示是否有效 2) 計算 Imputation Statistics 3) Transform Categorical Variables 4) Standardize Variables 5) Derive the Logarithm of the Target Variable 6) Suffer from Collinearity 如何掌握資料: 建立、量測長度、調整大小寫、去除多餘空格、格式化輸出、擷取部分文字、轉換為日期時間格式、根據特徵分隔、判斷特徵存在與否及存在位置、根據特徵取代、正規表達特徵以及應用文字處理函數至陣列上

Create Data Science Project from Scratch satellite-imagery-analysis-with-python

Statistics Probability and Statistics for Data Science #1 Estimating Probabilities with Bayesian Modeling Prediction: School Performance vs Income Linear Regression Intro Regression

零售採購 Retail Procurement America Land Use Hacker News Book Suggestion API

Matplotlib 分析資料 Dash Gapminder Datalab + BigQuery Consecutive Numbers: more_itertools Visualization Tools Animated Chart by R Analyzing Trump Tweet Spotify Data Retrieve Stream Twitter Data to MySQL

lmdb (lightning memory-mapped database) lmdb-embeddings

Visualization

large-scale-visualizations-and-mapping-with-datashader data-visualization-with-bokeh-in-python-part-one-getting-started Matplotlib Guide plotly.js grammar of graphics for effective visualization of multi dimensional

Botflow 3D Visualization to Tune Hyperparameters of ML Models Multi-Dimensional Data

Principal Component Analysis

customizing-plots-with-python-matplotlib

Force Directed Graph

Seaborn 使用 Matplotlib 為底層,改善預設圖案內容,讓畫面變好許多。

PyViz - SciPy 2018

Pandas

tidying-up-pandas why-and-how-to-use-pandas-with-large-data Data Clean: Missing Values Code Example Udemy: Data Analysis  Pandas Cheat Sheet Quick Dive Complete Tutorial to Learn Data Science from Scratch Nick Eubank Useful Snippets Pandas Snippets 範例教學 Scrape Weather Data with Pandas Pandas Dataframe as a Process Tracker (postgres example) Udemy

編號保留 0 符號 編號採 int 形態的話,不能是空值,變通方式是採 float 形態。

Pandas Basics Broadcasting Reshaping Pivot Table

指定 Excel 欄位的資料型別 Split a Column into Two

10 Minutes to Pandas

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

String Processing Methods

Balance Tasks between Pandas and PostgreSQL

NBA Practical Medium Data Analytics with Python (10 Things I Hate About Pandas) Geospatial Analysis 37:30 Filtering on Geodesic Features Exploring and Machine Learning for Airbnb Listings in Toronto

GeoPandas + Leaflet: 與既有工具合作 GeoPandas GIF ArcGIS QGIS PostGIS D3 geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map

5-mistakes-i-made-when-doing-custom-data-visualization-with-d3-js

Bokeh vs Dash Data Cleaning

Geographic Statistical Data with Google Maps

SQL for Mode Analytics

Numpy

numpy-guide-for-people-in-a-hurry Beautiful Code with NumPy Introduction to Data Analytics with Pandas

常態分配亂數 s = np.random.normal(100, 5, 100) 平均數, 標準差, 個數

b = s.astype(int) 轉成整數

b = b.clip(0, 100) 大於 100 則改為 100 小於 0 則改為 0

series = pd.Series(np.random.rand(n))
series = pd.Series(np.random.randint(1, 5, n))
series = pd.Series(np.random.randint(1, 5, n), dtype=np.float64)
series = pd.Series(np.random.randint(1, 5, n), dtype=np.float64, index=[n*x for x in range(n)])
series = pd.Series(np.random.randint(1, 100, n), dtype=np.float64, index=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')[:n])

Circular binary structure for SciPy morphological operations

import numpy as np
 
def circular_structure(radius):
    size = radius*2+1
    i,j = np.mgrid[0:size, 0:size]
    i -= (size/2)
    j -= (size/2)
    return np.sqrt(i**2+j**2) <= radius="" pre="">

eliminate double loop

In [1]: import numpy as np
In [2]: a = np.zeros((1, 2, 3))
In [3]: a.shape
Out[3]: (1, 2, 3)
In [4]: a_sum = a.sum(axis=-1)
In [5]: a_sum.shape
Out[5]: (1, 2)

把最後一個axis黏起來了

In [6]: a_sum0 = a.sum(axis=0)
In [7]: a_sum0.shape
Out[7]: (2, 3)

mars: 平行運算

SciPy

只 import scipy 是拿不到 io 這個 submodule 要用 from .... import 才拿得到,背後機制好玩。

Numba

Better Performance

Why Use Dask  Pyodide: Scientific Python in the Browser

PyTubes

Analysing 1.4 billion rows with Python

PySpark

brief-introduction-to-pyspark 介紹 Multi-Class Text Classification

NLP

Step-by-Step Guide: 1) Gather Data 2) Clean Data 3) Find a Good Data Representation 4) Classification 5) Inspection 6) Accounting for Vocabulary Structure 7) Leverage Semantics 8) End-to-End Approaches

最佳橋樑入門指南: 數據轉換步驟包括: 1) 文本分詞 2) 建立字典並將文本轉成數字序列 3) 序列的 Zero Padding 4) 將正解做 One-hot Encoding

NLP Fun 100-times-faster-natural-language-processing-in-python

allenai/allennlp Neo4j build-your-own-knowledge-graph going-dutch part2 improving machine learning model using geographical data

Deep Transfer Learning for Natural Language Processing: Text Classification

單字向量的相加沒有什麼物理意義: 一個文章就是一群字 然後看兩群字有多像 gensim TFIDF, BM25

named-entity-recognition-with-nltk-and-spacy

using-data-science-to-help-women-make-contraceptive-choices

Twitter Sentiment Analysis: 1) HTML Decoding 2) '@'mention 3) URL Links 4) UTF8 BOM (Byte Order Mark) 5) HashTag

Time Series Analysis

End-to-End Project on Time Series Analysis and Forecasting

Food Delivery Prediction 送餐時間預估: Google Map API