Bot that detects spam in comments #2 (more training data, SVM classifier, checking user previous comments, whitelist / blacklist / scamlist)
I updated a bot which purpose is to detect spam comments on Steem blockchain. It uses Multinomial Naive Bayes algorithm combined with SVM (model stacking). It can reply to spam comment and downvote it. I've done it for #polish community, but it can be adapted for every tag (or all tags) - it's a matter of training file.
Log from console:
I have stacked 4 algorithms: Multinomial Naive Bayes and 3 variants of SVM.
self.model = StackedModel([
MultinomialNB(),
SVC(kernel='linear', C=C, probability=True),
SVC(kernel='rbf', gamma=0.7, C=C, probability=True),
NuSVC(probability=True)
To check the accuracy, I calculated a confusion matrix for each algorithm.
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Confusion matrix:
[[65 1]
[ 0 45]]
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Confusion matrix:
[[65 1]
[ 0 45]]
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.7, kernel='rbf',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Confusion matrix:
[[66 0]
[ 0 45]]
NuSVC(cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, nu=0.5, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False)
Confusion matrix:
[[65 1]
[ 1 44]]
The confusion matrix for the stacked model looks as follows.
Stacked model
Confusion matrix:
[[65 1]
[ 0 45]]
As you can see the results are similar for each algorithm separately as well as for the stacked model. You have to experiment a bit here to find the best combination. The results will probably change slightly as the data set increases.
Bot checks not only current comment, but also previous comments. I think that single comment nice photo
is ok, but if user posts this type of comments all the time it is considered spam:
The bot also pays attention to repeated, generic comments:
And even scams (if user is on scamlist):
Running
$ POSTING_KEY=<posting_key> spam_detector.py config.json
Private posting key is stored as environment variable.
Configuration
All parameters are stored in config.json file.
Key | Value |
---|---|
account | account used by bot |
nodes | list of Steem nodes |
tags | tags which are observed |
probability_threshold | threshold to classify as spam |
training_file | input training file |
blacklist_file | file containing blacklist |
whitelist_file | file containing whitelist |
scamlist_file | file containing users who post scams |
reply_mode | 0 - without reply, 1 - with reply |
vote_mode | 0 - without vote, 1 with vote |
vote_weight | weight of the vote from range [-100.0, 100.0] |
num_previous_comments | number of user comments that are investigated |
Training file contains rows with label ham
or spam
like below:
ham Wow. Even though I was well aware of Churchill's later career, I actually didn't know he was here during the Anglo Boer war, let alone as a prisoner of war. Thank you for a very interesting and informative post!
ham Yea this post isn't really about fixing all the problems on Steem - it's just that there always seems to be a lot of drama over the trending page, and i think it's a bad thing for new people coming to the site to see first, so just throwing out the idea of getting rid of it for now.
ham Yea, I believe there was something about notifications in one of the SteemIt, Inc roadmaps but don't quote me on that. Notifications are really important though, can't expect everyone to use
ham Yeah, I may have to sit down & do a post or two myself! It’s fun to imagine! Other than promoted posts, I do think we should have advertising, albeit in a very user focused & friendly way.
ham Yes I agree. My suggestion was based on how things actually are currently which as you said is not representative of the best posts. I don’t believe that is going to change any time soon, if ever, so in the mean time I think it would be better to just get rid of that page.
ham Yes! This thought never occurred to me before, but your idea is perfect!! I think it would help underpaid content creators be noticed. Better yet, don't sort people based on potential payout. Create an algorithm that sorts out such things as grammatical and spelling errors, "articles" that are too short, authors that post 10 times per day, copy/paste content, ect. and only the highest quality bloggers would make it to the top...
ham Yes, there are only a few flagging because majority is scared. He has already ruined many people's accounts and reps and flagged all of their posts to $0.00 for voicing opinons. People disagree with the rewards of his posts. You are well aware of haejin's 10-12 posts per day reaching an easy $350 per post every time. I don't think anyone is against his predictions in the sense that anyone is able to use common sense and choose if they invest or not based on his predictions. I have not seen any whales helping recover these people's accounts for flagging him. Perhaps this is not an unjustified flag war? I have sacrificed my entire blog and all earnings for six weeks to try and lower the rewards. I am not scared of the consequences as I know what they are. People are scared though so I think if a lot of users delegate a small portion of their Steem power to one of these accounts then the rewards can be lower substantially. I also feel that it would be a more organized approach at flagging him as it will be a scheduled downvote of 10 posts every evening. I feel that if enough people make the delegation's he will be unable to flag every user that delegated down to $0.00 as he would have to use all of his power flagging instead of upvoting himself. You can count on support from whales to resolve unjustified flag wars, if you feel like post are more over-valued than the majority of Steem content then flag them and don't be scared of reprisals.
ham you are right. As it is now, he's spending a tonne of his vote power flagging anyone who disagrees with his rewards. He cannot flag everyone it would cut into his profits, as his vote power drains to 0. If rancho comes in and starts flagging too, then they are making even less money because now he's wasting his vote power by flagging instead of upvoting the 10 posts a day that he has to.
ham You know.. I delegated what little SP i can afford exactly because you took the risk. Now if he did wanna go all out flag, he'd had to waste his vp on both you and me. if enough people did it we can even go against the biggest abusers too.
ham Your concept is very solid, it might seem hard to implement in the start but I know that if you keep at it you will reach your goal!I cannot wait to start using your system!
spam i follow you
spam Upvote, follow, resteem
spam UPVOTED
spam UPVOTED & RESTEEMED
spam Upvoted and followed you back
spam UPVOTED RESTEEMED
spam very funny
spam very nice
spam Write Link, send 0.100 sbd. 3000+ followers can see you (resteem)
spam Yes very nice post.
Technology Stack
- python3.6
- libraries: steem-python, scikit-learn, pandas, textblob, bs4
Repository contains requirements.txt file.
Roadmap:
enlarging the training setadding new algorithm such as Support Vector Machinetaking into account previous comments, not only current oneadding to blacklist / whitelist- taking into account user reputation
- tuning parameters in existing algorithms
- adding new algorithm such as Neural Network and maybe Random Forest
- enlarging the training set (again)
Posted on Utopian.io - Rewarding Open Source Contributors
Thank you for the contribution. It has been approved.
You can contact us on Discord.
[utopian-moderator]
Hey @vladimir-simovic, I just gave you a tip for your hard work on moderation. Upvote this comment to support the utopian moderators and increase your future rewards!
I am very glad to see this!
This could improve the quality of the comments.
This is a good, useful, helpful and beneficial bot for the community, like @cheetah.
Further improvement on the bot would be, if the bot would get a lot of Steem Power and flag these spam comments itself.
I think comments like "Nice post", "Nice photo" and "Please follow me" should be automatically flagged as spam (especially, if the authors of these comments are upvoted their own comments), as these comments are meaningless to the author of the original post and the writers of these comments are only showing the fact that they don't really care about the original post and/or the author of the original post, they only want attention (to their own profiles, to their own posts), so they are doing it for their own good.
Great work! :)
What I really want to see is a REST api where I just put the comment URL or comment body and get a probability about it is spam or not. That would be epic.
Upvote + resteem
Tylko tyle mogę na tą chwilę.
Pozdrawiam.
PS. Już wiem napiszę o tym więcej!
Amazing, I think @emrebeyler was thinking about developing this. Can't wait to see it in action!
Impressive! What is the efficiency ratio (without false-positive)?
I divided the entire dataset in the ratio 80:20 into a training and test set. After training the model, I carried out the test, resulting in the following confusion matrix.
As you can see, the results are pretty good here. Only one type II error (False Negative).
But the real challenge here is precisely defining what is spam and what is not. At the beginning I was probably too overzealous, now I try to balance it more. Therefore, for example, I do not treat single comments like
nice photo
orplease follow me
as true spam, but only when they are repeated over and over again.That is why I constantly analyze the results and, if necessary, I make corrections in the data set / parameters, so that it all works well in practice, and not only in theory. The other thing is the fact that a lot of comments that the bot classifies as spam have to be ignored (bid-bots, photocontests, welcoming users) because marking them as spam would not end well :)
Hey @jacek-w I am @utopian-io. I have just upvoted you!
Achievements
Community-Driven Witness!
I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!
Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x
Congratulations @jacek-w! You received a personal award!
Click here to view your Board
Congratulations @jacek-w! You received a personal award!
Thank you for taking part in the early access of Drugwars.
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Do not miss the last post from @steemitboard:
Vote for @Steemitboard as a witness to get one more award and increased upvotes!
Congratulations @jacek-w! You received a personal award!
You can view your badges on your Steem Board and compare to others on the Steem Ranking
Vote for @Steemitboard as a witness to get one more award and increased upvotes!