Final Project - CSE 5334 - Data Mining

Manthankumar Patel
Apr 7, 2020
3 min read

Updated: May 11, 2020

Name: Manthankumar Patel

Student ID: 1001778249

Download Dataset
Project video
Kaggle link
Project application
GitHub link -
Download my Jupiter notebook(comment-predictor.ipynb) from GitHub

To run this .ipynb file, the file ‘boardgamegeek-reviews’ have to be in where .ipynb is located.

All cells in this .ipynb file must be executed sequentially.

The Goal of Project:

The purpose of this project is to build a model that will be trained on given data and will predict the ranting from 0 to 10 from the comment that we give. First, we will import the necessary libraries and the will load the data from csv file 'bgg-13m-reviews.csv'. Now, we will pre-process the data by formating the comments and rating and will divide into DataFrames and vectorize every DataFrames. I have used the SVM classifier because it is a supervised machine learning algorithm that can be used for classification. In general terms SVM is very good when you have a huge number of features. For example for text classification in a bag of words model.

Importing Libraries and Loading Data:

First, we will import useful libraries and build csv path to load the data, which upon download is located in a directory called boardgamegeek-reviews, in csv file named ‘bgg-13m-reviews.csv’.

Now, we will load the data, and using the pandas library we will read csv file. Then we will use only 'comment' and 'rating' and will replace NaN with ' '.

Formatting Data:

We have loaded data so we will do some preprocessing by formatting comments and ratings. From the comment, we have formatted white space and lower all the characters. From the rating, we have removed data that has 0 ratings and ensure that all ratings have values that are type float.

Splitting Data:

Now we have to split dataframe into trainData, devData, testData and reset the index for every dataframe. From print below it can be seen that the train dataframe is close to 2100000 while development and test dataframe is nearly 9.5% of all data.

Vectorization:

As you can see above, I tried to remove stop words from comments but data size is too large so it took an enormously long time so I am using vectorization. Values in the comment are string so we need to convert every string into numerical series using vectorizer. Now we will copy and store every vectorized comment in train_c, test_c, dev_c respectively, and rating in train_r, test_r, dev_r.

SVM Classifier:

Finally, we have data within the appropriate format and that we are able to begin classifier.

Here I have used the SVM classifier because I researched on the internet that which classifier we can use for large data and I got on the conclusion that SVM is very good for text classification and I have added the reference link bellow in References for this research. For example SVM with direct tf-idf vectors does the best both for quality & performance for text classification in a bag of words model.

Hyperparameter Tuning:

Now, it is time to make some improvement in accuracy by hyperparameter tuning. we have to find the Regularization parameter-C to increase performance. we test for different C and show a relation between C and accuracy and we select the most effective value of C.

From the above result, we can say that the accuracy is 34.10144 for the value of C=0.1.

Checking Final Accuracy:

Now, we will check our final accuracy on the test dataframe using Regularization parameter-C = 0.1.

Finally, That is a good improvement in accuracy over what SVM could do when it was only by itself. We build the model that uses SVM and after that, we have done hyperparameter tuning to find optimal value and tested on test data and got 35.10318% accuracy as you can see above.

Conclusion and Challenges:

In conclusion, we can say that by hyperparameter tuning we can increase accuracy in the SVM classifier. In addition, I learned, how vectorization is used to speed up the Python code without using a loop. In this project, there were many challenges because the data set was very huge and tough to predict that's why it took enormous time to execute vectorization and SVM. Plenty of the comments was very long and not all encoded within the same format. As for the ratings, class representation wasn't exactly even. To develop a model with such accuracy that would cope with such issues could likely only be finished by Neural Network. So many questions came in my mind such as- Which classifier should I use? In which way I can increase performance? How to increase accuracy using hyperparameter tuning? Am I able to execute an optimization idea to extend accuracy? For all of those questions, I got answers.

Thank you for reading my blog post.

References:

Shuffle the data:

https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows

Vectorization:

https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/

SVM:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

The Goal of Project:

Importing Libraries and Loading Data:

Formatting Data:

Splitting Data:

Vectorization:

SVM Classifier:

Hyperparameter Tuning:

Checking Final Accuracy:

Conclusion and Challenges:

References:

Comments