Monday, February 23, 2015

How to manage a large amount of data with (Postgre)SQL in R

In my R projects I've used data stored in a MySQL database before and it's straightforward. You send a query and you receive the result and that's it. It works well if you work with a small dataset and can manage to write the queries by hand or inserting/updating only tens or hundreds of rows in the database. Just looping through the data and doing the queries one by one is the fast (to write) and easy solution.

Recently I've had a project where the backend is PostgreSQL and the amount of data to pull and push is a few hundred thousand rows at the time. Querying is still fine, but doing updates in a loop is definately not. Even at 1 second per insert it would take days to push the results back into database. I could dump the results into a file and COPY that into database and that's probably the only solution once the number of rows gets into millions, but there is another way for intermediate data size (1k through 1mln).

The package used is RPosgtreSQL (with examples at Google Code). The aspect that makes inserting a large number of rows at once is the only very slightly mentioned multiple insert format - bottom of manual. The idea is that you just list the values one after the other:

INSERT INTO products (product_no, name, price) VALUES
    (1, 'Cheese', 9.99),
    (2, 'Bread', 1.99),
    (3, 'Milk', 2.99);

Sounds simple enough. But I hope you want to be safe in your SQL queries, so you escape all the values. It appears that part is somewhat complicated to do. Luckily the famous dplyr by Hadley comes to resque with functions like sql() and escape(). There isn't a good vignette out there for it, the best is probably just the ?sql help inside R.

The goal is to escape everything that is not a static text. Lets say we want to insert multiple rows into table v_tmp into columns named key_row, value1, value2 and added_time. And here's the code that does it - every value is escaped and put through the sql() function. The output looks properly weird, but when R sends it to database, then the correct things end up in tables.

df <- data.frame(key_row = 1:5, value1 = paste0("It's '", letters[1:5], "'!"), value2 = rnorm(5))
# key_row    value1      value2
# 1       1 It's 'a'!  2.74812502
# 2       2 It's 'b'! -0.06665964
# 3       3 It's 'c'! -0.40579730
# 4       4 It's 'd'! -0.41636723
# 5       5 It's 'e'!  0.56018515

# start off the insert clause
v.insert <- "INSERT INTO v_tmp(key_row, value1, value2, added_time) VALUES "
# pre-make the vector for pieces
v.pieces <- character(nrow(df))
# cycle through the data (in data.frame called df)
for (i in 1:nrow(df)) {
  v.pieces[i] <- sql(paste0(
    '(', escape(df$key_row[i]), ',
    ', escape(df$value1[i]), ',
    ', escape(df$value2[i]), ', now())'
  ))
}
# put it together
v.insert <- paste0(v.insert, paste0(v.pieces, collapse=","))
# v.insert result:
# [1] "INSERT INTO v_tmp(key_row, value1, value2, added_time) VALUES (1,\n    'It''s ''a''!',\n    2.74812501918287, now()),(2,\n    'It''s ''b''!',\n    -0.0666596436163713, now()),(3,\n    'It''s ''c''!',\n    -0.405797295384545, now()),(4,\n    'It''s ''d''!',\n    -0.416367233404705, now()),(5,\n    'It''s ''e''!',\n    0.560185149545129, now())"
# and into the database it goes
dbSendQuery(con, v.insert)


No comments:

Post a Comment