Welcome to data science

Comparison of packages RPostgreSQL and RPostgres, vol 1 - connecting

2015-03-18T17:29:00.000+02:00

Analysing data often includes that data being stored in a database. In my case it's stored in a PostgreSQL database. My data flow is pull data into R -> analyse it, compute some new features -> return data into database for other people and tools. I could do it via files, but apparently pulling data directly from Postgre into R is really fast.

The first thing though is connecting to said database. There's a choice between two packages, RPostgreSQL and RPostgres.The former is by Dirk Eddelbuettel and the latter is by Hadley Wickham. Very well known names in the R world and they were both also added as members to R Foundation in 2014.

Setting up the connection with RPostgreSQL:

library(RPostgreSQL)

dbhost <- '127.0.0.1'
dbport <- '5432'
dbuser <- 'lauri_koobas'
dbname <- 'my_db'

## loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host=dbhost, port=dbport, dbname=dbname, user=dbuser, password=scan(what=character(),nmax=1,quiet=TRUE))

The password part works in such a way as to prompt you for it in console window. I usually follow it up by hitting CTRL-L to clear the console as well. No need to save or show your password anywhere for too long.

Setting the connection up with RPostgres is pretty much the same, except you use RPostgres::Postgres() instead of drv as parameter for dbConnect.

But what if you need to run your script automatically? Or have it deployed to a production server where you don't have a way to go and type in your password. Writing passwords directly into code is a very bad idea as the code will end up in a repository and you really don't want it to end up in Github.

Well, there's more than one way to approach this problem as well. You could write the password to a configuration file that is not stored in a repository. Or you could put the whole connection string to a configuration file, so your script only loads it and connects.

But PostgreSQL offers another solution - the .pgpass file. It works a bit diffently in Windows and Unix, but essentially it's a type of configuration file that is especially meant for this exact problem. It's of the following format and you can use wildcards in it:

hostname:port:database:username:password

Using it is a bit tricky though and took me a while to figure out. What you need to do is provide everything BUT the password and then it will go and find the password in that .pgpass file. An example using RPostgres package.

.pgpass file:
*:*:my_db:lauri_koobas:password

R script:
con <- dbConnect(RPostgres::Postgres(), host="127.0.0.1", port="5432", dbname="my_db", user="lauri_koobas")

That's it. Keep your code versionized and your passwords safe!

Statistical Learning MOOC by Stanford

2015-03-17T12:28:00.000+02:00

I took the course and this is a short summary of my feelings about it.

Pros:

there was some humor (or attempts of it) by presenters
the book this course is based on is very good, at least the few pages of it that I managed to read
the interviews with the original inventors of some of the methods were cool!

Cons:

the quizzes were rather cryptic and of very variable difficulty, especially the ones concerning R
if you answered incorrectly then "show answer" was sometimes very helpful and other times not at all

If you've done mostly Coursera courses, then the style of quizzes is very different - the current course asks your intuition about exceptions to the rule. It will show you the limitations of what you just learned and is best treated as an additional learning resource. The forums were fairly dead, compared to your regular Coursera course, but still helpful - all you really need is one good answer.

In conclusion, the course helped me see and understand some new aspects of modeling. As it was my main goal anyway, then it's all good.

How to manage a large amount of data with (Postgre)SQL in R

2015-02-23T11:35:00.001+02:00

In my R projects I've used data stored in a MySQL database before and it's straightforward. You send a query and you receive the result and that's it. It works well if you work with a small dataset and can manage to write the queries by hand or inserting/updating only tens or hundreds of rows in the database. Just looping through the data and doing the queries one by one is the fast (to write) and easy solution.

Recently I've had a project where the backend is PostgreSQL and the amount of data to pull and push is a few hundred thousand rows at the time. Querying is still fine, but doing updates in a loop is definately not. Even at 1 second per insert it would take days to push the results back into database. I could dump the results into a file and COPY that into database and that's probably the only solution once the number of rows gets into millions, but there is another way for intermediate data size (1k through 1mln).

The package used is RPosgtreSQL (with examples at Google Code). The aspect that makes inserting a large number of rows at once is the only very slightly mentioned multiple insert format - bottom of manual. The idea is that you just list the values one after the other:


INSERT INTO products (product_no, name, price) VALUES

    (1, 'Cheese', 9.99),

    (2, 'Bread', 1.99),

    (3, 'Milk', 2.99);

Sounds simple enough. But I hope you want to be safe in your SQL queries, so you escape all the values. It appears that part is somewhat complicated to do. Luckily the famous dplyr by Hadley comes to resque with functions like sql() and escape(). There isn't a good vignette out there for it, the best is probably just the ?sql help inside R.

The goal is to escape everything that is not a static text. Lets say we want to insert multiple rows into table v_tmp into columns named key_row, value1, value2 and added_time. And here's the code that does it - every value is escaped and put through the sql() function. The output looks properly weird, but when R sends it to database, then the correct things end up in tables.


df <- data.frame(key_row = 1:5, value1 = paste0("It's '", letters[1:5], "'!"), value2 = rnorm(5))

# key_row    value1      value2

# 1       1 It's 'a'!  2.74812502

# 2       2 It's 'b'! -0.06665964

# 3       3 It's 'c'! -0.40579730

# 4       4 It's 'd'! -0.41636723

# 5       5 It's 'e'!  0.56018515



# start off the insert clause

v.insert <- "INSERT INTO v_tmp(key_row, value1, value2, added_time) VALUES "

# pre-make the vector for pieces

v.pieces <- character(nrow(df))

# cycle through the data (in data.frame called df)

for (i in 1:nrow(df)) {

  v.pieces[i] <- sql(paste0(

    '(', escape(df$key_row[i]), ',

    ', escape(df$value1[i]), ',

    ', escape(df$value2[i]), ', now())'

  ))

}

# put it together

v.insert <- paste0(v.insert, paste0(v.pieces, collapse=","))

# v.insert result:

# [1] "INSERT INTO v_tmp(key_row, value1, value2, added_time) VALUES (1,\n    'It''s ''a''!',\n    2.74812501918287, now()),(2,\n    'It''s ''b''!',\n    -0.0666596436163713, now()),(3,\n    'It''s ''c''!',\n    -0.405797295384545, now()),(4,\n    'It''s ''d''!',\n    -0.416367233404705, now()),(5,\n    'It''s ''e''!',\n    0.560185149545129, now())"

# and into the database it goes

dbSendQuery(con, v.insert)

Function to easily draw a scatterplot with polynomial regression lines

2015-02-20T14:36:00.003+02:00

In exploring ones data (e.g., for subsequent modeling), it is often useful to fit different order polynomial regression lines to compare how they fit.

Although R is very good for plotting, adding nonlinear regression lines to a plot is a bit tedious. Here’s a simple function 'polyreglines' that plots a scatterplot of x ja y and adds polynomial regression lines up to specified order. It also adds a legend with adjusted R-squared values for the models. When argument “all” is set to FALSE, only one regression line of specified order is drawn.

polyreglines <- function(x.txt, y.txt, data, order=3, all=T, xlab, ylab, leg.pos, ...) {
x <- with(data, get(x.txt))
y <- with(data, get(y.txt))
if(missing(xlab)) xlab <- x.txt
if(missing(ylab)) ylab <- y.txt
if(all==T) {
plot(x, y, xlab=xlab, ylab=ylab , ...)
R2s <- numeric(length=order)
for(i in 1:order) {
fit <- lm(y~poly(x,i), data=data)
R2s[i] <- round(summary(fit)$adj.r.squared,2)
x1 <- seq(from = min(x), to = max(x), length.out = 1000)
g <- data.frame(x = x1)
lines(x1, predict(fit, g), lty=i)
}
if(missing(leg.pos)) leg.pos = "topright"
legend(leg.pos,legend=R2s,lty=c(1:order),title=expression(Adjusted ~ R^2))
}
else {
plot(x, y, xlab=xlab, ylab=ylab , ...)
fit <- lm(y~poly(x,order), data=data)
R2s <- round(summary(fit)$adj.r.squared,2)
x1 <- seq(from = min(x), to = max(x), length.out = 1000)
g <- data.frame(x = x1)
lines(x1, predict(fit, g))
if(missing(leg.pos)) leg.pos = "topright"
legend(leg.pos,legend=R2s,lty=1,title=expression(Adjusted ~ R^2))
}
}

Note that x and y are column names and so have to be within quotation marks. (It is also possible to pass further arguments to plot and also specify the legend position.)

Example:

polyreglines("mpg","hp", mtcars, 2)

polyreglines("mpg","hp",mtcars,2,all=F,leg.pos="bottomleft", main="Example")

If you think this function is useful for you, you can save the code in a text file (e.g. as “polyreglines.R”) in your working directory and load it using source() (e.g. source(“polyreglines.R”))

Waking up the blog again!

2014-09-17T23:25:00.003+03:00

It's been a good long while since my last post. It feels like it's time to pick it up again much more regularly. Someone said that one should think about the balance between how much they consume and how much they produce. The author of if was talking about intellectual stuff. So here I am - reading massive amounts daily, but writing nothing down. It's about to change since I also know that expressing thoughts verbally is good for retention.

Lets start with an awesome post I read this morning - Modern anti-spam and E2E crypto. It talks at length about how Google and other email service providers have waged war with spammers. He also links to a Google blog post that talks about how they decide if you should be asked additional verification upon log in attempt:

In fact, there are more than 120 variables that can factor into how a decision is made.

Talk about feature engineering :)

But how is that all connected to data science? Well, all of it is classification problems, coupled with mostly reinforcement learning. Most of the stuff the guy writes about has probably never seen the light of day - all proprietary magic. Fascinating! :)

Anyways, my evening watch list has been this - videos from ICML2014. Enjoy.

Taking job interviews is good

2014-08-01T10:51:00.002+03:00

I had a real job interview yesterday for a data scientist position in a (very well funded) startup. It was a short lunch meeting with the objective of "getting to know you". It was awkward and uncomfortable and I thought long and hard about writing this blog post.

So here's a list of things that were "off" and what I learned from it:

a meeting needs an objective that is predefined or at least specified at the start of the meeting. I learned that if there isn't one, I need to ask the other party to define one. It's like a friend of mine just wrote: "If you don't know the question how will you know if/when you get the answer?"
meeting for lunch is OK, but you need a place with limited noise. I learned that if I either organize or agree to a meeting in public space, I need to scout it out beforehand or be vocal about being bothered by the noise.
when you meet for lunch, organize the questions-answers so that everyone has a chance to enjoy or at least finish their meal. Once again it was a new experience for me and I'll verbalize this the next time.
have the courage to say "I don't think it's a good fit". Leaving with a "we might get back to you either before or after vacations or something, maybe" is just insulting. In future I will make a concluding remark myself if the other side is avoidant.

Interviewing for a position is a skill, from both sides of the table. Only way to get better is to practice. The experience I had yesterday was a good learning experience and I'll continue the journey :)

Jobs in data science or data analytics

2014-06-18T21:22:00.000+03:00

I kept putting off writing this blog post hoping to get some more info or that something would happen, but neither did. The job market seems bleak.

I've been checking the local (Estonian) job offers in for a few months now, looking for signs of data science/analytics jobs. There are some, but the wording is weird and it shows that the companies are hiring their first "data people".

Out of interest I went and applied to a few of these positions. One company actually knew a little something about modeling and they were looking for someone to work on classification. Others were completely random. Most didn't reply and the few that did expressed that they didn't really know what they were looking for. Reminded me a lot of the "if carpenters were hired like programmers" joke.

Anyways, so that's the current status of data science jobs in Estonia. Which suits me fine since I'm just an enthusiast.

The Machine Learning by Andrew Ng started yesterday on Coursera. It's supposedly very good, but I haven't had the time to check it out yet. I'm still finishing up the Practical Machine Learning by Jeff Leek and it's pretty good. The materials are understandable and quizzes are mostly clear with multiple choice answers so there's no problems with number formats and stuff.

There was a first Meetup for a group called Startup Founders 101, run by Development Fund (Arengufond). For the first occasion the topic was "From Employee to Entrepreneur". It got me thinking that I could probably start doing some consulting in data science in a few months. Real practice with real problems! Still an idea tho.. we'll see what happens :)

To MOOC or not to MOOC?

2014-06-04T00:40:00.003+03:00

MOOCs are Massive Open Online Courses of course. So the question is if one should use this resource on the way to becoming a data scientist? The focus of this post is mostly on machine learning (ML) as it's the newest and messiest of the areas that make up data science (remember the Venn diagram). It's also the area that I'm actively learning and experimenting in.

MOOCs are a new thing. It got really started with AI Class by Sebastian Thrun and Peter Norvig, which led to the founding of Udacity. I took the first and only offering of that course back in fall of 2011 and enjoyed it very much. My Sunday mornings and later days and evenings were spent watching the videos and working the problems. Fun times :)

A few more players have entered the MOOC market since then and all of the major players offer courses on machine learning (only listing those that are upcoming or have materials available):

Udacity - Machine Learning 1, 2, 3 and Intro to Data Science
Coursera - Probabilistic Graphical Models, Natural Language Processing, Machine Learning (starting soon), and a whole Data Science specialization (9 courses)
EdX - Sabermetrics 101
Kaggle - practice / introductory competitions for first steps in machine learning
Caltech - Learning From Data

And so on... If you need a refresher on some concepts then Khan Academy is a great resource on pretty much everything. In addition to math they cover stuff from probability through to confidence intervals.

Want to talk to others about machine learning and ask questions? Sure, all of the courses have message boards, but that's course-specific. Reddit has an active ML-specific subreddit with over 24K subscribers. Forums at the MOOCs are actually good as well - you will see approaches and explanations that you would never come up with yourself!

Now that we've established that there's a great many ways to actively learn machine learning and data science - should you? Well, the online courses are of variable length and quality and I've only taken a few of them so far. The original AI Class is no longer available, so that's that. The Coursera specialization has some interesting courses, but the one on statistical inference was a letdown for me. It was a birds-eye view of probability, hypothesis testing, some Bayesian inference and power calculations. If you know most of that stuff and want a refresher then it's a good course to take, especially since all the quizzes are available from day 1. I only knew about half, so that was OK, but the other half went over my head and would've required time I didn't have to really dig into it with help from other sources.

I'm taking the Regression Models and Practical Machine learning on Coursera right now, as well as Machine Learning that starts in a couple of weeks. While I have some experience with natural language processing and topic modeling, these should help me get a better understanding of a few more areas.

So yeah, go on and check out these courses, identify your skill level and start learning! Take the free courses first :)

Next post should be about jobs in data science.

The beginning

2014-06-02T05:31:00.001+03:00

I've been following the Big Data and Data Science hype for a few years now. So far it's been something that's done elsewhere, mostly in top companies like Google, Microsoft and Amazon, along with top universities like MIT, Stanford and Cambridge. It's pretty intimidating trying to be a data science enthusiast from Estonia :)

Just this morning I found (the original) data science Venn diagram that puts things into perspective for me. I've been trying to figure out how a classification expert (subset of machine learning) could be useful in psychology if he or she lacks the knowledge to make sense of the results. I just completed my Masters level studies in psychology and it's apparent that it's not something that a machine learning expert can just jump into without extensive study or very close collaboration effort. The Venn diagram sums it up pretty well - you need domain knowledge, math & stats skills and be able to work with data and algorithms to get to data science.

Why would one aim for data science? Well, as the Wikipedia article explains, it's "extraction of knowledge from data". In my studies I've seen up close the process of gathering data on how children study - they fill out tests and partake in experiments, the same is asked of their teachers and sometimes parents. All of it takes years of effort by teams of people and results in a large amount of data which is then subjected to a tiny sample of traditional research: "lets see if these two things correlate". It does result in articles, but mostly only to confirm the ideas previously thought. This is actually well and good as the people here are truly masters of their trade in the realms of developmental and educational psychology. On the other hand, it feels like the work of all the children and researchers deserves another chance in the hands of a data scientist. Or rather similar work to be done in the future as the current data sets are of a quality to take even the best data wrangler to an early grave.

To sum up, I have some math and stats background, have been writing software more than 20 years, did some sentiment analysis at Cambridge Uni last summer and consider myself a data science enthusiast. Future posts will explore MOOCs currently available in this direction, overview of job offers in data science and other musings on this topic.