2014年11月26日 星期三

Useful Packages in R

Package Name Purpose Studay? Reference Sites
dplyr Data manipulation ing… http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
http://rpubs.com/justmarkham/dplyr-tutorial
ggplot2 Visualization ing… http://docs.ggplot2.org/current/
http://www.victorgau.com/?p=3448 (中文)
http://blog.liang2.tw/2013-RConf-ggplot2-intro/?full#cover (中文)
http://r4stats.com/examples/graphics-ggplot2/
https://trinkerrstuff.wordpress.com/2013/07/05/ggplot2-chloropleth-of-supreme-court-decisions-an-tutorial/ (US map)
googleVis Visualization ing…

Examples
PPT
Package Vignette
Examples
http://www.magesblog.com/search/label/googleVis
Detail Configuration

ggmap Map ing… http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
http://xccds1977.blogspot.tw/2012/06/ggmap.html
xlsx Read data from Excel    
XML Parse data    
RCurl Network client interface    
rworldmap World map with iso code    
WDI World Bank indicators    
RgoogleMaps Map   http://cran.r-project.org/web/packages/RgoogleMaps/RgoogleMaps.pdf
rMap Map   http://rmaps.github.io/blog/posts/animated-choropleths/index.html
shiny Presentation    
slidify Presentation    
FastRWeb Presentation    
quantmod Data source ing… http://www.quantmod.com/
Quandl Data source ing… https://www.quandl.com/help/r
https://s3.amazonaws.com/quandl-static-content/Documents/Quandl+-+R+Cheat+Sheet.pdf
wordcloud Word cloud ing… https://georeferenced.wordpress.com/2013/01/15/rwordcloud/
tm Text mining ing… http://www.slideshare.net/whitish/textmining-with-r
qdap Text mining   http://cran.r-project.org/web/packages/qdap/index.html
Rwordseg Text mining    
gridExtra Visualization Layout ing… http://www.r-bloggers.com/extra-extra-get-your-gridextra/
       
       

http://www.rstudio.com/products/rpackages/

To be study:

1. Word2Vec, http://blog.csdn.net/itplus/article/details/37969519

2014年11月23日 星期日

Power of Three

image

image

image

image

image

image

Three Learning Principles

1. Occam’s Razor

2. Sampling Bias

3. Data Snooping

 

Occam’s Razor: trimming down unnecessary explanation

The simplest model that fits the data is also the most plausible.

image

image

image

image

image

image

 

Sampling Bias: If the data is sampled in a biased way, learning will produce a similarly biased outcome.

image

image

image

data and testing should be both iid from P

image

random for validation vs. last records for test, that’s why still lost the contest

image

Consider using same distribution (sampling) in all of training, validation and testing phases

1. Emphasize weight in training if need

2. Match validation with test scenario as much as possible

image

image

 

Data Snooping

image

image

Red: using entire 8 years data for training although the performance is good (snooping)

Blue: using 6 years for training and 2 years for testing, the result is even negative

image

對犯人逼共久了,任誰都會招供的!!!

image

1. 避免偷看資料後決定模型

2. 時刻存著懷疑

image

2014年11月21日 星期五

Validation in Machine Learning

image

We have to try different combinations to get a good g.

 

image

No, we cannot do it.

 

image

2H 2 learnings, because we will increase model complexity which is not what we want.

 

image

The problem is where can we find those testing data in advance.

 

image

Etest is infeasible, how about we reserve some Ein data (so called Eval) and those data is never used to calculate from beginning, so we can do validation later on Eval data.

 

image

D = Dtrain + Dval = training data + validation data

K = reserve size K data for validation

N-K = size of training data, use to get the best gm-

 

image

1. Learning curve, remember the picture shows more data for learning we can get smaller Eout value

2. Using Dtrain to calculate all g-, then using Dval to check which Eval is the best to become the best g-(for example, #37)

3. Why not we do it again by all data D to re-calculate g-(#37) to get the newest gm

 

image

Why? because when we use more Dval, that mean on the other hand size of Dtrain is smaller, so Eout become worse

 

image

It’s really hard to define K to meet the requirement saying it must be very small and very big at the same time.

K = N/5 in practical, but someone may use N/10 or N/20 …etc

 

image

image

The answer is 3: 25 * (4/5N)^2 + N^2.

One interesting point here, the time consume using validation learning is faster than normal learning without validation (which is 25N^2)

 

image

Take K = 1 as the example, means that we leave only one data for validation, en is the error of that specific validation data.

If we do a looping job for each single data (as the validation data), then sum and average them, we will call it Eloocv (leave-one-out cross validation) estimation

 

image

2 models in here, the first one is using linear model, the other one is using constant model.

By calculate both Eloocv values, we know the second one (constant) is better.

Why? e1 of model 1, the distance is the square, which is very big.

 

image

If we take 1000 as the example, Eout(N – 1) means: the error of Eloocv(1000) is the same as Eout(999), which is also equal to Eout(g-), so it is almost the same as Eout(g)

 

image

1. More features used for Ein, the result will be more good. But the curve is too complicate

2. Often Eout is small when middle features used, too many features will go into over-fitting problem

3. Leave-One-Out Cross Validation shows 5 or 7 features used can have lower Eloocv, on the other hand the curve is more simple then Ein

 

image

Constant of e1 = 6 (by 5, 7), e2 = 4 (by 1, 7), e3 = 3 (by 1, 4)

then image,

so Eloocv = (25 + 1 + 16)/3 = 42/3 = 14

 

image

If the data size is 1000, then we have to do 1000 times, which is not feasible in practice

 

image

Instead of taking 1000 times, how about if we cut the data only to 10 parts (=V), so that we can decrease computation need.

 

 image

image

2014年11月19日 星期三

RPubs in R

http://rpubs.com/

Register a free account

SNAGHTML168350b

SNAGHTML16db255

 

Publish to RPubs

1. Click “Knit HTML” to preview HTML result

image

2. Click “Publish” button (don’t forget to login RPubs in advance)

image

3. Type “Title” and “Description” of the document

SNAGHTML16d0c85

4. A URL with identify ID has been generated

SNAGHTML176d060

5. Can “Edit Details” or “Delete” the document

6. Click “Share” button to share the link to Twitter, Facebook or Google+

SNAGHTML17783a7

7. Check “Recently Published”

SNAGHTML179f741