over 2 years ago
The submission deadline for the Big Data for Social Good Challenge is March 3, 2015. As you continue working on your project, the following tips may come in handy if you're using Big R.
Test your application with a small data set first
Some of the machine learning and statistical capabilities of Big R may require significant processing time, slowing down your prototype development. So, while you are incrementally developing and improving upon your project, it may be worthwhile to work with a smaller sample of your data set to verify that the results match your expectations (Use a "?" sign before the function name to get its documentation, e.g., ?bigr.sample). Once you are confident in your code, bring in the entire data set. BigSheets can also help you scan and filter through the data. Look up the documentation of bigr.frame or bigr.matrix for more information on how to access your HDFS data in Big R.
Data Formatting is crucial
Data formatting is crucial to get the most out of your analysis. For example, every line in your data set must have the same number of delimiters if your data source is a flat delimited file (for example, csv file). Pay special attention to the end and beginning of your data set. E.g., if your data set contains a header row, make sure you set header=TRUE when building the bigr.frame or bigr.matrix. Furthermore, colnames and coltypes parameters are especially crucial if you are using your data set for machine learning.
We provide a variety of machine learning algorithms, such as support vector machine ?bigr.svm, naïve Bayes classifier ?bigr.naive.bayes, logistic regression ?bigr.logistic.regression, linear regression ?bigr.lm, generalized linear model ?bigr.glm and K-means clustering ?bigr.kmeans. The general workflow for using machine learning algorithms is a three-step process:
- Split your data into training and test sets (see ?bigr.sample to split a data set). The split sizes of your training and test sets will be dependent on your particular problem type.
- Train your model.
- Test your model's prediction accuracy with the test set (use the Big R package's generic predict function to predict on a test data set, see ?predict).
Provide different HDFS paths for models learned using different parameters
All the machine-learning model scripts store the model, retrievably, on HDFS. That means that next time you open your R session, you do not have to learn the model again, but you can load the model from the HDFS location given at the time of learning the model. It is helpful to store your models learned with different parameter values at separate HDFS locations so that you can compare their performance simultaneously on the same data set. The same function used to learn a model can also be used to load a model from a given HDFS location. E.g., for SVM:
- To learn a model use: bigr.svm(VARIABLE_TO_MODEL~., TRAINING_DATA, directory=PATH_TO_STORE_MODEL_ON_HDFS)
- To load a previously stored model use: bigr.svm(directory=PATH_TO_LOAD_MODEL_FROM)
Enable Big R debugging if needed and rerun
Big R debug logging can be turned on by calling one of the unexposed functions bigr:::bigr.debug(TRUE). The logs may be helpful to highlight the point of problem in the workflow in case you face any issues. Remember that a lot of background communication and processing goes on in the background for every Big R command and the debug mode can provide some insights into the background processes in action.
Get set with Analytics for Hadoop
Don't forget that there are a bunch of tutorials available for Analytics for Hadoop. Read through them to help troubleshoot, get ideas, or do a quick review.
- IBM Analytics for Hadoop
- IBM Analytics for Hadoop Documentation
- Analytics for Hadoop Tutorials
- IBM Hadoop Dev
- More Hadoop Tutorials from IBM [Restriction: You are not able to install and use extra R packages. As a consequence, if you follow one of the tutorials: Analyzing big data with Big R, you cannot complete lessons 4 and 5 because you cannot download extra R packages as instructed.]
If you have more questions about using Big R for your project or getting started with Analytics for Hadoop, please post on the challenge discussion board.