Vinayak Machan • about 5 years ago

### Unable to transform bigr.frame to bigr.matrix using bigr.transform

Thanks to Bluemix support, I finally was able to download the updated bigr package which has the features to actually create linear models using bigr.

These model functions though need a bigr.matrix and do not accept a bigr.frame. So I tried using the bigr.transform function that transforms a bigr.frame to a bigr.matrix. This ran for about 10 minutes (the data frame has about 2 million records) and then errored out. I have listed below the steps I am doing and the corresponding error. The error by itself does not help much in narrowing down the cause.

Has anyone been able to successfully do this? Would appreciate some help.

# Code snippet below

bigr.connect(host="bi-hadoop-prod-XXX.services.dal.bluemix.net",

port=7052, database="default",

user="user", password="password")

is.bigr.connected()

bdf <- bigr.frame(

dataPath = "/user/data/part-00000"

,dataSource = "DEL"

,delimiter=","

, header = T

)

bmtrx <- bigr.transform(

bdf

,outData = "/user/data/matrix/outdata"

,transformPath = "/user/data/matrix/transform"

)

modLM <- bigr.lm(Y ~ X1 + X2 + X3 + X4, data=bmtrx) # This statement throws the below error

ERROR LISTED BELOW

Error: BigR[.bigr.jdbc.query.helper]: Error code : 0, SQLState : 08003

Caused by :

The stack trace :java.sql.SQLException

at com.ibm.biginsights.bigsql.jdbc.BigSQLStatement.executeQuery(BigSQLStatement.java:94)

Look forward to some responses.

Regards

Vinayak

Comments are closed.

## 5 comments

Weiqiang Zhuang • about 5 years ago

Hi Vinayak,

So far the formula we support for the ML APIs has only the form of Y ~ . (Note, the dot symbol), which means Y against all the other features in the matrix.

If your data has more columns/features and you just want to have the X1, X2, X3 and X4 to build the linear regression model, you can go with this alternative route:

1) create the bigr.frame bdf as you have done

2) project just the columns of interest:

bdf_p <- bdf[, c(X1, X2, X3, X4, Y)]

3) perform transform on the bdf_p similar to what you have done

4) build the model:

modLM <- bigr.lm(Y ~ ., data=bmtrx, directory="/user/data/modLM")

Let me know if you get into any issue with above.

Thanks.

Adrian

Vinayak Machan • about 5 years ago

Hi Adrian

Thanks for this update. I tried the steps you listed and am able to successfully derive a linear model using the bigr.lm function. Many thanks for your detailed answer.

Is there a way to plot this model? The regular plot function does not work. Also I see help for functions such as bigr.boxplot but when I try to run them I get an error such as "Error: could not find function "bigr.boxplot".

Thanks again for the answer to my earlier question.

Regards

- Vinayak

Oscar Lara-Yejas • about 5 years ago

Hi, Vinayak.

R does allow to plot the residuals and Q-Q plot, among others, to have a feel of the goodness of fit of the regression. In the context of big data, plotting millions or even billions of data points becomes not only technically challenging but it may also be unnecessary.

Big R does not provide model visualization out of the box. If you want to analyze the quality of the predictions generated by a linear regression model with Big R, the predict(bigr.lm,...) method gives you some statistics, including the R squared error, and some other metrics (see below).

$statistics

Name Y-column Scaled Value

1 LOGLHOOD_Z NA FALSE NaN

2 LOGLHOOD_Z_PVAL NA FALSE NaN

3 PEARSON_X2 NA FALSE 2.254519e+06

4 PEARSON_X2_BY_DF NA FALSE 5.486783e+02

5 PEARSON_X2_PVAL NA FALSE 0.000000e+00

6 DEVIANCE_G2 NA FALSE 2.254519e+06

7 DEVIANCE_G2_BY_DF NA FALSE 5.486783e+02

8 DEVIANCE_G2_PVAL NA FALSE 0.000000e+00

9 LOGLHOOD_Z NA TRUE NaN

10 LOGLHOOD_Z_PVAL NA TRUE NaN

11 PEARSON_X2 NA TRUE 2.254519e+06

12 PEARSON_X2_BY_DF NA TRUE 5.486783e+02

13 PEARSON_X2_PVAL NA TRUE 0.000000e+00

14 DEVIANCE_G2 NA TRUE 2.254519e+06

15 DEVIANCE_G2_BY_DF NA TRUE 5.486783e+02

16 DEVIANCE_G2_PVAL NA TRUE 0.000000e+00

17 AVG_TOT_Y 1 NA 7.033779e+00

18 STDEV_TOT_Y 1 NA 3.066331e+01

19 AVG_RES_Y 1 NA -6.203834e-01

20 STDEV_RES_Y 1 NA 2.341850e+01

21 PRED_STDEV_RES 1 TRUE 1.000000e+00

22 PLAIN_R2 1 NA 4.171571e-01

23 ADJUSTED_R2 1 NA 4.164479e-01

24 PLAIN_R2_NOBIAS 1 NA 4.175666e-01

25 ADJUSTED_R2_NOBIAS 1 NA 4.167159e-01

Now, if you really want to compute the residuals, you can do so using Big R's arithmetic operators. Big R allows you to subtract two columns in a bigr.matrix in CSV format like this:

residuals <- bm$real_value - bm$predicted_value

To do that, you just need to append the column with the predictions to the testing set. You can do so by using the bigr.cbind() function:

pred <- predict(lm, test_set....)

bm <- bigr.cbind(test_set, pred$predictions, "[PATH_ON_HDFS]")

Again, once you have bm with both the predictions and the ground truths, you can subtract them as described above.

Best,

Oscar

Oscar Lara-Yejas • about 5 years ago

And, once you have the residuals (you will notice it is a bigr.vector object), you can turn it into a vector using as.vector(residuals), and then you can plot it as you would plot any R object.

Best,

Oscar

Vinayak Machan • about 5 years ago

Thanks Oscar, I will try and use the approach you mentioned.

Regards

Vinayak