Data Analysis
Pandas concat 的 join 預設值應該都是 outer join 不會掉 column data-viz-with-python-apps-dashboards Exploratory Data Analysis in two lines of code using sweetviz 3-intermediate Excel functions and how-to-do-them-in-python visualization with maps
數感
統計學知識: 集中趨勢 眾數/中位數/平均數; 變異性 四分位數/四分位距/異常值/方差/平方偏差/貝塞爾矯正; 歸一化 標準分數; 正態分佈; 抽樣分佈; 估計 置信度/信任區間; 假設檢驗 顯著性水準; T檢驗
資料科學研究流程 Speed up Your Data Analysis in Python Pandas in the Premier League did-you-know-pandas-can-do-so-much validating-python-data-with-cerberus Image Recognition using Tensorflow and Probability Convolutional Neural Networks the biologically inspired model what is a container architecture design capsule networks the new deep learning network functional programming guide to happy
ELT
used-pandas-to-automate-cleaning-excel-files-with-python
Applied Data Science: Every Arrow on this diagram is a data science project
Agile Data Science (2019/01/15): Avro gmail.py Pig MongoDB ElasticSearch Bootstrap Flask D3 | AWS S3
ways to improve a Map Visualization Exploratory Statistical Data Analysis with a real dataset using Pandas
From Scratch: 常見問題 1) 圖示是否有效 2) 計算 Imputation Statistics 3) Transform Categorical Variables 4) Standardize Variables 5) Derive the Logarithm of the Target Variable 6) Suffer from Collinearity 如何掌握資料: 建立、量測長度、調整大小寫、去除多餘空格、格式化輸出、擷取部分文字、轉換為日期時間格式、根據特徵分隔、判斷特徵存在與否及存在位置、根據特徵取代、正規表達特徵以及應用文字處理函數至陣列上
Create Data Science Project from Scratch satellite imagery analysis with python
Statistics Probability and Statistics for Data Science #1 Estimating Probabilities with Bayesian Modeling Prediction: School Performance vs Income Linear Regression Intro Regression
零售採購 Retail Procurement America Land Use Hacker News Book Suggestion API
Matplotlib 分析資料 Dash Gapminder Datalab + BigQuery Consecutive Numbers: more_itertools Visualization Tools Animated Chart by R Analyzing Trump Tweet Spotify Data Retrieve Stream Twitter Data to MySQL Twitter Word Cloud
lmdb (lightning memory-mapped database) lmdb-embeddings
pingouin ten SQL concepts you should know for data science interviews
df = pd.DataFrame({ 'DD': ['101/1/1', '101/2/1', '101/3/1'] }) df[['Year', 'Month', 'Day']] = df['DD'].str.split('/', expand=True) >>> print(df) DD Year Month Day 0 101/1/1 101 1 1 1 101/2/1 101 2 1 2 101/3/1 101 3 1
Visualization
streamlit dashboard plotting-in-pandas-just-got-prettier auto generated knowledge graphs The Grammar of Graphics forget matplotlib you should be using plotly
R 語言的資料視覺化工具有靜態的 Base Plotting System (內建) 跟 ggplot2 套件,動態是使用 plotly 套件,Python 語言對應的靜態有 matplotlib 跟 seaborn 套件,動態是使用 bokeh 套件。
Seaborn Tutorial: seaborn tries to make a well-defined set of hard things easy
Dash by Plotly large scale visualizations and mapping with datashader Data Visualization with Bokeh in Python part-one getting started Matplotlib Guide plotly.js grammar of graphics for effective visualization of multi dimensional
Botflow 3D Visualization to Tune Hyperparameters of ML Models Multi-Dimensional Data
customizing-plots-with-python-matplotlib
https://stackoverflow.com/questions/5854515/interactive-large-plot-with-20-million-sample-points-and-gigabytes-of-data 圖例
Seaborn 使用 Matplotlib 為底層,改善預設圖案內容,讓畫面變好許多。
PyViz - SciPy 2018
Pandas
Data Frame to Postgresql value_count 可用 sidetable 取代
好像是說 pandas 會預設用 c engine 然後 c engine 會容易導致編碼錯誤
Tidying up Pandas why-and-how-to-use-pandas-with-large-data Data Clean: Missing Values Code Example Udemy: Data Analysis Pandas Cheat Sheet Quick Dive Complete Tutorial to Learn Data Science from Scratch Nick Eubank Useful Snippets Pandas Snippets 範例教學 Scrape Weather Data with Pandas Pandas Dataframe as a Process Tracker (postgres example) Udemy
編號保留 0 符號 編號採 int 形態的話,不能是空值,變通方式是採 float 形態。
Pandas Basics Broadcasting Reshaping Pivot Table
指定 Excel 欄位的資料型別 Split a Column into Two pd.pivot_table 或 groupby([學生,科目])[‘成績‘].mean().unstack() 個人比較喜歡用groupby和aggregation function 覺得pivot比較不直觀,可能沒受過excel訓練太多吧
dates = pd.date_range('20130101', periods=6) df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
Balance Tasks between Pandas and PostgreSQL
NBA Practical Medium Data Analytics with Python (10 Things I Hate About Pandas) Geospatial Analysis 37:30 Filtering on Geodesic Features Exploring and Machine Learning for Airbnb Listings in Toronto
GeoPandas + Leaflet: 與既有工具合作 GeoPandas GIF ArcGIS QGIS PostGIS D3 geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map PowerBI + ArcGIS
5-mistakes-i-made-when-doing-custom-data-visualization-with-d3-js
Geographic Statistical Data with Google Maps
Modin: accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.
Numpy
numpy-guide-for-people-in-a-hurry Beautiful Code with NumPy Introduction to Data Analytics with Pandas
常態分配亂數 s = np.random.normal(100, 5, 100) 平均數, 標準差, 個數
b = s.astype(int) 轉成整數
b = b.clip(0, 100) 大於 100 則改為 100 小於 0 則改為 0
series = pd.Series(np.random.rand(n)) series = pd.Series(np.random.randint(1, 5, n)) series = pd.Series(np.random.randint(1, 5, n), dtype=np.float64) series = pd.Series(np.random.randint(1, 5, n), dtype=np.float64, index=[n*x for x in range(n)]) series = pd.Series(np.random.randint(1, 100, n), dtype=np.float64, index=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')[:n])
Circular binary structure for SciPy morphological operations
import numpy as np def circular_structure(radius): size = radius*2+1 i,j = np.mgrid[0:size, 0:size] i -= (size/2) j -= (size/2) return np.sqrt(i**2+j**2) <= radius="" pre="">
In [1]: import numpy as np In [2]: a = np.zeros((1, 2, 3)) In [3]: a.shape Out[3]: (1, 2, 3) In [4]: a_sum = a.sum(axis=-1) In [5]: a_sum.shape Out[5]: (1, 2) 把最後一個axis黏起來了 In [6]: a_sum0 = a.sum(axis=0) In [7]: a_sum0.shape Out[7]: (2, 3)
空值的處理 多重index: nansum(), nanmin() nanmax() np.nan
https://ithelp.ithome.com.tw/articles/10200433 concat merge append join http://violin-tao.blogspot.com/2017/06/pandas-2-concat-merge.html
SciPy
只 import scipy 是拿不到 io 這個 submodule 要用 from .... import 才拿得到,背後機制好玩。
Numba
Why Use Dask Pyodide: Scientific Python in the Browser
PyTubes
Analysing 1.4 billion rows with Python
PySpark
brief-introduction-to-pyspark 介紹 Multi-Class Text Classification
NLP
TextBlob uses NLTK as backend NLP toolkit
Step-by-Step Guide: 1) Gather Data 2) Clean Data 3) Find a Good Data Representation 4) Classification 5) Inspection 6) Accounting for Vocabulary Structure 7) Leverage Semantics 8) End-to-End Approaches
最佳橋樑入門指南: 數據轉換步驟包括: 1) 文本分詞 2) 建立字典並將文本轉成數字序列 3) 序列的 Zero Padding 4) 將正解做 One-hot Encoding
NLP Fun 100-times-faster-natural-language-processing-in-python
allenai/allennlp Neo4j py2neo build-your-own-knowledge-graph going-dutch part2 improving machine learning model using geographical data
Deep Transfer Learning for Natural Language Processing: Text Classification
單字向量的相加沒有什麼物理意義: 一個文章就是一群字 然後看兩群字有多像 gensim TFIDF, BM25
named-entity-recognition-with-nltk-and-spacy
Data Science to help Women make Contraceptive Choices
Twitter Sentiment Analysis: 1) HTML Decoding 2) '@'mention 3) URL Links 4) UTF8 BOM (Byte Order Mark) 5) HashTag
BERT: Transformer
Time Series Analysis
End-to-End Project on Time Series Analysis and Forecasting
Food Delivery Prediction 送餐時間預估: Google Map API
Network Analysis
NetworkX vs iGraph 音訊串流是有格式分別的 一般wav 是 signed int16 pcm@44100Hz 但是用 Numpy的話很高機率會被你用成 float32 (或float64) 內容當然不一樣 (你用librosa的話預設是float32) 此外,wav還有44bit檔頭(不需要的話就要拿掉) 以及單聲道/多聲道的差別 如果你音檔很長的話,最好要用串流的方式傳送 然後算法最好也要是串流算法 這樣lantency才會低(如果你要做低延遲應用的話) Flask的話 你可以搜尋 flask socket io audio