Rob Campanell • over 11 years ago

Big R

I am having an issue with converting a Big R frame to an R data frame. Does anyone know what is going on here? I am using the Big R documentation found here.

http://www-01.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.bigr.doc/doc/00Index.html

> outflow0405 <- bigr.frame(dataSource="DEL",dataPath="/tmp/hive-hive/irs/countyinflow0405.csv",delimiter=',',header=TRUE,
+ coltypes = ifelse(1:9 %in% c(5,6),"character", "integer"),useMapReduce=TRUE)

> str(outflow0405)
'bigr.frame': 9 variables:
$ State_Code_Dest : int 0 0 0 0 0 1
$ County_Code_Dest : int 0 0 0 0 0 0
$ State_Code_Origin : int 96 97 97 97 98 96
$ County_Code_Origin: int 0 0 1 3 0 0
$ State_Abbrv : chr "US" "US" "US" "US" "US" "AL"
$ County_Name : chr "Total Mig - US & For" "Total Mig - US" "Total Mig - US Same St" "Total Mig - US Diff St" "Total Mig - Foreign" "Total Mig - US & For"
$ Return_Num : int 7238512 7029750 4023108 3006642 208762 96365
$ Exmpt_Num : int 13719599 13400949 7656393 5744556 318650 199539
$ Aggr_AGI : int 314723861 309552240 167622376 141929864 5171621 3643250

> countyoutflow0405 <- as.data.frame(outflow0405)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

17 comments
Oscar Lara-Yejas • over 11 years ago

Hello.

Could you point me to the dataset to give it a try? My initial guesses are that either (1) some rows have missing delimiters, or (2) some string values were found in an integer/numeric column.

Thanks,
Oscar
Rob Campanell • over 11 years ago

Here's where I connected

bigr.connect(host="bi-hadoop-prod-166.services.dal.bluemix.net", port=7052, user="biblumix",)

I used the password found on my Bluemix IBM Analytics for Hadoop welcome screen under connect and manage files.

The command used to access the data is listed in my previous post with bigr.frame command.

I am using R Studio.
Oscar Lara-Yejas • over 11 years ago

Hi, Rob.

bigr.frame objects support many of the primitives/operations/functions you can apply to data.frames, such as subsetting, arithmetics, summaries, sort, merge, sampling, histograms, aggregate functions, etc., using the exact same or a very similar syntax. Therefore, you probably don't even need to turn the bigr.frame into a data.frame. Also, data.frame is constrained by the main memory size, while bigr.frames can take arbitrarily large datasets.

If you still want to do as.data.frame, could you try removing the coltypes parameter in the constructor of bigr.frame?

outflow0405 <- bigr.frame(dataSource="DEL",dataPath="/tmp/hive-hive/irs/countyinflow0405.csv",delimiter=',',header=TRUE,
useMapReduce=TRUE)

And then retry str() and as.data.frame()? Could you paste the output of both?
Rob Campanell • over 11 years ago

Hi Oscar,

I removed the coltypes parameter and here are the results:

> outflow0405 <- bigr.frame(dataSource="DEL",dataPath="/tmp/hive-hive/irs/countyinflow0405.csv",delimiter=',',header=TRUE,
+ useMapReduce=TRUE)

> str(outflow0405)
'bigr.frame': 9 variables:
$ State_Code_Dest : chr "00" "00" "00" "00" "00" "01"
$ County_Code_Dest : chr "000" "000" "000" "000" "000" "000"
$ State_Code_Origin : chr "96" "97" "97" "97" "98" "96"
$ County_Code_Origin: chr "000" "000" "001" "003" "000" "000"
$ State_Abbrv : chr "US" "US" "US" "US" "US" "AL"
$ County_Name : chr "Total Mig - US & For" "Total Mig - US" "Total Mig - US Same St" "Total Mig - US Diff St" "Total Mig - Foreign" "Total Mig - US & For"
$ Return_Num : chr "7238512" "7029750" "4023108" "3006642" "208762" "96365"
$ Exmpt_Num : chr "13719599" "13400949" "7656393" "5744556" "318650" "199539"
$ Aggr_AGI : chr "314723861" "309552240" "167622376" "141929864" "5171621" "3643250"

> countyoutflow0405 <- as.data.frame(outflow0405)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

Removing the coltypes parameter converts all the columns to character in the bigr.frame

My data is in separate files for each year. I need to add new columns to indicate the year and put them together into one dataset. Unfortunately, BigR does not have the rbind() function, so I need to convert them to data frames.
Oscar Lara-Yejas • over 11 years ago

How big is your data? I'm just afraid that if your data is too large, when you do rbind() to combine all individual files, R may run out of memory.

Even though we don't support rbind(), you can proceed as follows:

1. Create a list of bigr.frame's, one for each year.
2. You can add the column 'year' to each bigr.frame using Big R's $<- operator
3. Persist all bigr.frame's into the same HDFS folder using bigr.persist(). This will create new files on HDFS. Name the files (dataPath) using the consecutive year (the alphabetical order of the file names will determine the order of the data). Have header=false for all bigr.frame's but the first one.
4. Create a bigr.frame on the folder path you just created. Big R will pick up all files in the folder.

Still, it is odd that as.data.frame fails here, which makes me thing that the data could have some missing delimiters. Could you try something like summary(outflow0405) and paste the results?

I'm not sure I can access your host. Could you point me to the data?
Rob Campanell • over 11 years ago

Oscar, I give it a try. Thanks for the suggestion.

Here's my summary results

> str(outflow0405)

'bigr.frame': 9 variables:
$ State_Code_Dest : int 0 0 0 0 0 1
$ County_Code_Dest : int 0 0 0 0 0 0
$ State_Code_Origin : int 96 97 97 97 98 96
$ County_Code_Origin: int 0 0 1 3 0 0
$ State_Abbrv : chr "US" "US" "US" "US" "US" "AL"
$ County_Name : chr "Total Mig - US & For" "Total Mig - US" "Total Mig - US Same St" "Total Mig - US Diff St" "Total Mig - Foreign" "Total Mig - US & For"
$ Return_Num : int 7238512 7029750 4023108 3006642 208762 96365
$ Exmpt_Num : int 13719599 13400949 7656393 5744556 318650 199539
$ Aggr_AGI : int 314723861 309552240 167622376 141929864 5171621 3643250
> summary(outflow0405)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

> summary(outflow0405$State_Code_Dest)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

Here's the link to the data set I am using here.

http://www.irs.gov/file_source/pub/irs-soi/countyoutflow0405.csv
Rob Campanell • over 11 years ago

Oscar

Here's all my attempts at bigr.persist(). I'm still getting the same error message.

> bigr.persist(outflow0405, dataPath="/tmp/outflow", header=TRUE)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

> bigr.persist(outflow0405, dataSource="DEL", dataPath="/tmp/outflow", header=TRUE)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

> bigr.persist(outflow0405, dataSource="DEL", dataPath="/tmp/outflow/outflow0405.csv", delimiter=',', header=TRUE)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

> bigr.persist(outflow0405, dataSource="DEL", dataPath="/tmp/outflow/outflow0405.csv", delimiter=',', header=TRUE, useMapReduce=TRUE)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.
Herman Sahota • over 11 years ago

Hi Rob, there seems to be an issue with the format of your dataset. It looks like the dataset contains two extra blank rows at the bottom and BigR does not like that. I tried as.data.frame() on your bigr.frame after removing one extra row at the bottom of your dataset and I can see the data.frame then.
Hope that helps,
Herman
Vinayak Machan • over 11 years ago

Hi Rob,

Were you able to fix this problem? I am getting the same error when using the NYPD 311 curated data set, the link to the data is listed below
https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

The data set is pretty large so am unable to browse and detect empty lines on my PC.

Could let us know if you solved this issue and if so, how?

Regards
Vinayak
Herman Sahota • over 11 years ago

Hi Vinayak
Could you post the set of commands to reproduce the error that you are getting. E.g. do you try to create a data.frame out of bigr.frame like Rob is doing? Otherwise, it may be a good idea to start a new discussion topic.
Vinayak Machan • over 11 years ago

Hi Herman

Listed below is the code I run, the error I get is on the last line where I do a summary. I actually am able to create a bigr.frame but am unable to use it for a summary.

library(bigr)
bigr.connect(host="bi-hadoop-prod-345.services.dal.bluemix.net",
port=7052, database="default",
user="", password="")

is.bigr.connected()

nypd311 <- bigr.frame(dataPath = "/user/biblumix/vm_nypd_311_svc_data/311_Service_Requests_from_2010_to_Present.csv",
dataSource = "DEL",
delimiter=",", header = T)
names(nypd311)
str(nypd311)
head(nypd311)

# Replace spaces and parentheses in colnames;
# these give an error when trying to do a summary
names(nypd311) <- gsub(" ", "_", names(nypd311))
names(nypd311) <- gsub("\$", "", names(nypd311))
names(nypd311) <- gsub("\$", "", names(nypd311))

summary(nypd311)
Error: BigR[.bigr.executeJaqlQuery]: The number of columns specified in colnames/coltypes does not match the number of columns in the dataset.

Unsure if this needs a separate discussion topic.

Regards
Vinayak
Vinayak Machan • over 11 years ago

Hi Herman

I tried the above steps with a smaller data set (top 10000 records of the original file) and it works with that. So it does look like the original file may have some data issues that I need to fix first.

Do you know if we get command-line access (via ssh or putty) to these environments? It would be easier to identify the problem on Linux rather than on my Windows PC. I think this question would be a different topic though.

Thanks
Vinayak
Herman Sahota • over 11 years ago

Yes, it does look like there are some issues with the data formatting. (Usually there is a problem at the end of the file.)

I don't have an answer to your question about ssh access to the nodes but someone else might. Feel free to post a new topic for this one.
Rob Campanell • over 11 years ago

I'm still trying to figure out this issue. I match on row counts when I read the data in from my PC in R Studio and when I import the same data file into BigSheets.
Vinayak Machan • over 11 years ago

I found that the last record in my file was incomplete i.e. all the fields were not present. I had Git Bash on my windows PC, so was able to do a 'tail' on the file and identify this issue.

I removed this record and then re-ran the file thru the code I pasted earlier and this time it worked fine. I used the BigSheets filter feature to remove the record based on the unique key value that is present on all records.

@Herman - Thanks for your input.

@Rob - I think the row counts would still match, but when you try to do an operation where the columns need to be enumerated/accessed, this error is encountered since there is a mismatch between the number of columns in the metadata (e.g. colnames) and the actual number of columns in the data record.

Hope this helps.
Vinayak
Rob Campanell • over 11 years ago

Vinayak, Thanks for your help. I ran nrow on the bigr.frame and found one extra record. However when I tried to remove it, I am encountering another bigr error message. Any suggestions?

> outflow0405_s <- outflow0405[1:113920,]
Error: BigR[bigr.frame.[]]: The given filtering condition must be a logical bigr.vector.
Vinayak Machan • over 11 years ago

Rob,

I ran into this error as well. I was unable to do any manipulation in R to eliminate the offending record. I had to eliminate it the record outside of the R environment and then subsequently read it into R.

Since my file size was bigger (~ 1GB) I was unable remove the bad record on my PC. The way I did this was
1. First upload file (including bad records) to the BigInsights HDFS file system.
2. Convert it into a BigSheets worksheet.
3. Open the worksheet in BigSheets and search for the record using the unique key. The data set I am working with has a unique key as the first column.
4. Delete the record.
5. Save the worksheet in BigSheets
6. Export it to a plain HDFS file. This final file was what I then used to read into the bigr.frame in R.

Hope this helps.

Regards
- Vinayak

Comments are closed.

Big R

17 comments