Student name: Jack Simpson
Project title: An Evaluation of the Capability of Current Spam Tweet Detection Models to Detect Spam Tweets in Real-Time Environments
Course: Bsc (Hons) Computing
This project aimed to assess the capability of spam tweet detection models to detect spam tweets in real time. The literature review showed that researchers have been creating increasingly accurate machine learning models which can predict whether a tweet is spam or not. Recently, researchers have begun to claim that their models can detect spam in ‘real time’ (the speed at which tweets are received through the twitter API). This is important as tweets are used as a data source both academically and commercially. Being able to filter spam from the API would increase the integrity of twitter as a data source. I recreated and tested various existing models, comparing their results against data gathered concerning the requirements of twitter live streams both academically and commercially.
In this project, models proposed by researchers were recreated, in a python environment (virtual environment) using sci-kit learn, from which data on how fast these models process tweets was collected. This was compared to information obtained on the speed of the twitter API in real world scenarios, including a use-case scenario of a local data analytics company.
The results showed that all models tested were capable of, in some capacity, real-time detection, depending on the level of filtering applied to the API. Only one model was capable of accurately detecting spam tweets from the fastest twitter API. This work proves that, theoretically, these models could be used in a real-time environment. The logical next step would be a project which implements a model into a real-world environment. This would show the effect of the method of implementation on the ability of the model to process tweets in real time.
The outcome was that there were various types of “live stream” environments, all models tested could keep up with the slowest of these environments, however only one model tested was capable of accurately processing tweets in the fastest of these environments. The next step would be to implement one of these models and see to what extent the infrastructure required for implementation affects the speed of the model.
My time at University
During my time in the at Edge Hill, I have achieved external qualifications such as Microsoft’s Databases certification. I also took advantage of the classroom assistant scheme, in which I had a paid job as classroom assistant in the second-year database seminars during my final year.