A software engineering for Data Science project
The purpose of this project has been to follow a typical software life cycle model from beginning through
end, while improving an existing machine learning model to classify Amazon reviews using sentiment analysis.
OpenUp Agile methodology has been followed. The wrapping of the final optimized model in a user-friendly
interface is discussed, to allow quick and easy positive/negative predictions to new review text.
The goal of the project has been to improve the accuracy and reliability of the existing machine learning model by following a typical software life cycle process. The original model made predictions with an accuracy of ~60%, which is no better than random guessing when taking into account the spread of the data. The original model exclusively predicts a rating of 1.0, which accounts for 60% of the entire dataset, and therefore results in the 60% accuracy.
For our model, target labels have been transformed into binary (positive and negative) sentiment. We have improved the overall accuracy, and achieved good performance on precision, recall, F1 and AUC (area under the ROC curve).
For the term project, our solution architecture is limited to MS Azure ML Studio. We delivered our improved model and outputs via ML Studio directly.
In addition, we have drawn out what the architecture might look like for a production deployment, where we would have users interact with a simple UI attached to API endpoint. In the UI, users can input review text and get a “pos/neg” sentiment returned. We would have our reference data in Azure storage and use ML Studio to build the model. The model and API would be deployed using an AKS (Azure Kubernetes Service), since we want to have a real-time interaction and response.
For more information about architecture and designs: https://hrahhrah.github.io/Architecture_Design.pdf
As stated previously, we researched and experimented with various two class algorithms and data processing steps such as feature hashing and text preprocessing. After extensive testing and evaluation, we chose a model using the Decision Jungle algorithm with an SQL transformation to split the reviews into binary classes, and the text preprocessing, Extracting Key Phrases From Text, Latent Dirichlet Allocation and SMOTE modules from Azure.
Our final model metrics:
Risk Number | Risk | Likelihood (0-1) |
Impact (1-10) |
Risk Score | Mitigation |
---|---|---|---|---|---|
1 | Model process failure | 0.1 | 10 | 1 | MS Azure ML has high availability, the likelihood of a system failure is low and any downtime would be minimal. |
2 | Model scoring failure | 0.2 | 8 | 1.6 | Depending on the algorithm, some models are more likely to degrade over time than others. In addition, some algorithms do not provide details about the score, and incorrect predictions may be difficult to explain. This risk will be taken into account during the model selection and evaluation steps. |
3 | Data input (user error) | 0.6 | 4 | 2.4 | Users inputting non-text or non review data will be scored unless error handling is built into the model. This risk is not captured in the requirements and any documentation created or user manuals should explain how to use the product. |
Github: https://github.com/MattAgone/DSCI644-Team-B-Project Trello: https://trello.com/b/UvRzi2Zz/dsci644-team-b-term-project Architecture and designs: https://hrahhrah.github.io/Architecture_Design.pdf Original project proposal: https://hrahhrah.github.io/Project_Proposal.pdf Presentation Slides: https://hrahhrah.github.io/Presentation.pdf Final Report: https://hrahhrah.github.io/Final_Report.pdf