Wednesday, March 18, 2015

Comparison of packages RPostgreSQL and RPostgres, vol 1 - connecting

Analysing data often includes that data being stored in a database. In my case it's stored in a PostgreSQL database. My data flow is pull data into R -> analyse it, compute some new features -> return data into database for other people and tools. I could do it via files, but apparently pulling data directly from Postgre into R is really fast.

The first thing though is connecting to said database. There's a choice between two packages, RPostgreSQL and RPostgres.The former is by Dirk Eddelbuettel and the latter is by Hadley Wickham. Very well known names in the R world and they were both also added as members to R Foundation in 2014.

Setting up the connection with RPostgreSQL:

library(RPostgreSQL)

dbhost <- '127.0.0.1'
dbport <- '5432'
dbuser <- 'lauri_koobas'
dbname <- 'my_db'

## loads the PostgreSQL driver
drv <- dbDriver("PostgreSQL")

## Open a connection
con <- dbConnect(drv, host=dbhost, port=dbport, dbname=dbname, user=dbuser, password=scan(what=character(),nmax=1,quiet=TRUE))

The password part works in such a way as to prompt you for it in console window. I usually follow it up by hitting CTRL-L to clear the console as well. No need to save or show your password anywhere for too long.

Setting the connection up with RPostgres is pretty much the same, except you use RPostgres::Postgres() instead of drv as parameter for dbConnect.

But what if you need to run your script automatically? Or have it deployed to a production server where you don't have a way to go and type in your password. Writing passwords directly into code is a very bad idea as the code will end up in a repository and you really don't want it to end up in Github.

Well, there's more than one way to approach this problem as well. You could write the password to a configuration file that is not stored in a repository. Or you could put the whole connection string to a configuration file, so your script only loads it and connects.

But PostgreSQL offers another solution - the .pgpass file. It works a bit diffently in Windows and Unix, but essentially it's a type of configuration file that is especially meant for this exact problem. It's of the following format and you can use wildcards in it:

hostname:port:database:username:password

Using it is a bit tricky though and took me a while to figure out. What you need to do is provide everything BUT the password and then it will go and find the password in that .pgpass file. An example using RPostgres package.

.pgpass file:
*:*:my_db:lauri_koobas:password

R script:
con <- dbConnect(RPostgres::Postgres(), host="127.0.0.1", port="5432", dbname="my_db", user="lauri_koobas")

That's it. Keep your code versionized and your passwords safe!

No comments:

Post a Comment