*TrufflePig*: A Bot based on Natural Language Processing and Machine Learning to support Content Curators and Minnows

in #utopian-io7 years ago (edited)

Project Overview

Steemit can be a tough place for minnows, as new users are often called. I had to learn this myself. Due to the sheer amount of new posts that are published by the minute, it is incredibly hard to stand out from the crowd. Often even nice, well-researched, and well-crafted posts of minnows get buried in the noise because they do not benefit from a lot of influential followers that could upvote their quality posts. Hence, their contributions are getting lost long before one or the other whale could notice them and turn them into trending topics.

However, this user based curation also has its merits, of course. You can become fortunate and your nice posts get traction and the recognition they deserve. Maybe there is a way to support the Steemit content curators such that high quality content does not go unnoticed anymore. In fact, I developed a curation bot called TrufflePig to do exactly this with the help of Natural Language Processing and Machine Learning.

The Concept

The basic idea is to use well paid posts of the past as training examples to teach a Machine Learning Regressor (MLR) how high quality Steemit content looks like. In turn, the trained MLR can be used to identify posts of high quality that were missed by the curation community and did receive much less payment than they deserved. We call this posts truffles.

The general idea of this bot is the following:

  1. We train a Machine Learning regressor (MLR) using Steemit posts as inputs and the corresponding Steem Dollar (SBD) rewards and votes as outputs.

  2. Accordingly, the MLR should learn to predict potential payouts for new, beforehand unseen Steemit posts.

  3. Next, we can compare the predicted payout with the actual payouts of recent Steemit posts. If the Machine Learning model predicts a huge reward, but the post was merely paid at all, we classify this contribution as an overlooked truffle and list it in a daily top list to drive attention to it.

The deployed bot can be found here: https://steemit.com/@trufflepig. Furthermore, a recent top list is this publication, for example, or this super fresh exemplar from today.

tolis

Technology Stack

The project is written in 100% pure Python 3 and uses the following third-party libraries:

  • The official Steem Python library: The library downloads past and recent Steemit posts from the Blockchain. In addition, Steem Python is used to comment under truffle posts and publish a daily top-list.

  • gensim: Part of the feature encoding involves projecting Steemit posts into a vector space using the library's Latent Semantic Indexing functionality.

  • langdetect: This library is used to filter for English only Steemit posts.

  • pyEnchant: Counts spelling mistakes in posts.

  • pyphen: pyphen is used to compute the number of syllables of all words in Steemit posts.

  • pandas: DataFrames are the standard data container format used throughout the project.

  • language-check: This library helps to identify and quantify grammar mistakes in Steemit posts.

  • Scikit-Learn: Of course, the most widely used Python machine learning library is also applied in this project.

Feature Encoding and Machine Learning

Usually the most difficult and involved part of engineering a Machine Learning application is the proper design of features. How are we going to represent the Steemit posts so they can be understood by a the Machine Learning regressor?

It is important that we use features that represent the content and quality of the post. We do not want to use author specific features such as the number of followers or past author payouts. Although I am quite certain (and I'll test this sometime soon) that these are incredibly predictive features of future payouts, these do not help us to identify overlooked and buried truffles.

I used some features that encode the style of the posts, such as number of paragraphs, average words per sentence, or spelling mistakes. Clearly, posts with many spelling errors are usually not high-quality content and are, to my mind, a pain to read. Moreover, I included readability scores like the Flesch-Kincaid index to quantify how easy and nice a post is to read.

Still, the question remains, how are we going to encode the content of the post? How to represent the topic someone chose and the story an author told? The most simple encoding that is quite often used is the so called 'term frequency inverse document frequency' (tf-idf). This technique basically encodes each document, so in our case Steemit posts, by the particular words that are present and weighs them by their (heuristically) normalized frequency of occurrence. However, this encoding produces vectors of enormous length with one entry for each unique word in all documents. Hence, most entries in these vectors are zeroes anyway because each document contains only a small subset of all potential words. For instance, if there are 150,000 different unique words in all our Steemit posts, each post will be represented by a vector of length 150,000 with almost all entries set to zero. Even if we filter and ignore very common words such as the or a we could easily end up with vectors having 30,000 or more dimensions.

Such high dimensional input is usually not very useful for Machine Learning. We rather want a much lower dimensionality than the number of training documents to effectively cover our data space. Accordingly, we need to reduce the dimensionality of our Steemit post representation. A widely used method is Latent Semantic Analysis (LSA), often also called Latent Semantic Indexing (LSI). LSI compression of the feature space is achieved by applying a Singular Value Decomposition (SVD) on top of the previously described word frequency encoding.

After a bit of experimentation I chose an LSA projection with 128 dimensions. In combination with the aforementioned style features, each post is, therefore, encoded as a vector with 150 entries.

For training, the bot reads posts between 7 and 17 days of age. These are first filtered and subsequently encoded. Usually, this leaves a training set of about 70,000 contributions. Too short posts, way too long ones, non-English, or posts with too many spelling errors were removed from the training set. The resulting matrix of size 70,000 by 150 is used as the input to a multi-output Random Forest regressor from scikit learn. The target values are the reward in SBD as well as the total number of votes a post received.

After the training, scheduled once a week, the Machine Learning regressor is used on a daily basis on recent posts between 2 and 26 hours old to predicted the expected reward and votes. Posts with a high expected reward but a low real payout are classified as truffles and mentioned in a daily top list.

A more detailed explanation together with a performance evaluation of the setup can also be found in this post.

Future Roadmap

  • I want to conduct further experiments with different ML regressors as well as feature encodings. I already made some experiments using Doc2Vec instead of LSI. But this was not very fruitful. A more thorough investigation may improve the bot's judgment further.

  • A very new feature of the bot is that posts voted by @cheetah are automatically excluded from any analysis. Besides, I would also like to make the bot respect and follow various spammer blacklists as used by @steemcleaners.

  • It would be cool, if in addition to the daily top list, you could call the bot manually by commenting @trufflepig under your post. The bot would answer, give you an estimate of how much it thinks your post is worth, and upvote you in case of a truffle. Currently, TrufflePig is a daily batch job and needs to be turned into a service instead.

  • Parts of the bot, especially the LSI encoding, could be reused for information retrieval and a Steemit recommendation system. For example, read recommendations could be given like: If you enjoyed this post you might also be interested in the following contributions ... This would be based on cosine similarity among different posts' LSI encodings.

  • I am planning for a German version *Trüffelschwein* that particularly digs for truffles among German Steemit posts to support the DACH community. I'm open to other languages as well (kr!?), but would, of course, need help by another developer with the corresponding mother tongue :-).

Open Source and Contributions

Finally, the project is freely available and open sourced at my github profile. *TrufflePig* can be used by anyone in a non-commercial setting, please, check the LICENSE file.
Of course, contributions in form of pull-request, github issues, or feature requests are always welcome.

Cheers and have fun with:

trufflepig

*TrufflePig*

(By the way, the bot's avatar has been created using https://robohash.org/)



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Hey @smcaterpillar, this is the Sorin human who has spoted you as an underrated contributor to a better world ! I'll be in Berlin on March 15th and 16th, would you perhaps care to meet in person?

A few comments on your post: given widespread payout manipulation, a more complex algorithm will certainly improve the results in the future. You write "The basic idea is to use well paid posts of the past as training examples to teach a Machine Learning Regressor (MLR) how high quality Steemit content looks like" - that is good enough in a first version but in the long run it is not the best value-adding approach because it offers "reflexion" / "echo" and encourages "more of the same".

The referential for what "quality" means should be external to the mechanism that Steemit uses to value posts, otherwise it comes very close to what Baron Munchhausen was doing by "pulling himself up by the straps of his boots".

underrated contributor to a better world !

I'm not so sure if my bot makes the world any better. I'm at least glad that Steemit operates on proof of (delegated) stake and is not as wasteful as BTC. So at least my bot doesn't make the world any worse.

[..]given widespread payout manipulation[...] The referential for what "quality" means should be external to the mechanism that Steemit uses to value posts, otherwise it comes very close to what Baron Munchhausen was doing by "pulling himself up by the straps of his boots".

You do have a point regarding the massive manipulation due to voting bots and services. By the way, irony intended by using such a service for your comment?! :-D

However, I beg to differ here at least to some degree. If there wasn't any correlation between payouts and quality than Steemit's premise as a curation platform due to proof of brain and sheer existence would be fruitless. So the bot's idea is to pull back attention from the voting bots and steer the platform as whole more towards rewarding quality content, whatever this is.

This brings me directly to my second point. What is quality content? This is really hard to evaluate. Is it something chosen by a jury or high intellectuals? Or stuff picked directly by the readers themselves? TrufflePig relies on the latter. Yet, if there was some external measure of quality, what would it be?

Of course, the taste of the masses may not cater to the taste of an individual. For instance, I can't stand most of today's popular or chart music :-D. Finding content that is right for you, in particular, is definitely not the aim of @trufflepig. However, I do see the need for more personalized recommendations. To quote the wise words from someone who knows much, much more about this platform than me (@lextenebris):

One of the big problems with Steemit as I see it is the fact that trying to find content that you're interested in is like sipping from a fire hose. One directed straight into your face.

I'm experimenting currently how I can reuse parts of the bot to create more personalized recommendations. So stay tuned.

Here's my LinkedIn profile

Sure, why not, added you as a connection.

"By the way, irony intended by using such a service for your comment?! :-D"

Absolutely ! I'm experimenting in order to learn because the whole mechanics is not only complex it is also obscure (probably on purpose). I intend experimentation to go on at several levels - for instance I've "pumped" my last post over the $100 bar, see if this psychological threshold plays any role here ... not sure but we'll see. I do believe my post is good though :-)

Then back to the trufflepig discussion - I absolutely agree that the whole idea of rewarding content in Steemit is valid: there definitely IS correlation between payout and quality! But my argument went to the "second degree" and looked at trufflebot: since the correlation is not 1 and Steemit is ALREADY using this assessment dimension, re-using it in trufflebot is "procyclical" and reinforces whatever bias this dimension has.

On the contrary introducing another assessment dimension helps to give more balance and offers an alternative. Precisely because it's difficult to say what is quality and all we know is that the equation "high payout = quality" certainly does NOT hold (not 100%, not for all posts anyway) then maybe we can do better by defining quality along more than one axis / dimensions

And the idea is that through the trufflepig YOU, the owner of the bot, are free to define your own assessment dimension. Some people will certainly disagree with your choice of what you consider to be quality but so what ? They are free to create their own trufflepig and train it with their parameters if they wish.

Congratulations! Your post has been selected as a daily Steemit truffle! It is listed on rank 1 of all contributions awarded today. You can find the TOP DAILY TRUFFLE PICKS HERE.

I upvoted your contribution because to my mind your post is at least 70 SBD worth and should receive 170 votes. It's now up to the lovely Steemit community to make this come true.

I am TrufflePig, an Artificial Intelligence Bot that helps minnows and content curators using Machine Learning. If you are curious how I select content, you can find an explanation here!

Have a nice day and sincerely yours,
trufflepig
TrufflePig

Hurray! 1337!

This is really not staged, I swear! He came up with this selection by himself!

This is a very interesting concept. Thank you for stopping by my blog @trufflepig! There are so many posts written by minnows that go unnoticed. We try to find some of these undervalued authors with the #newbieresteemday initiative, but it's a manual search and curation. I will definitely be following along!

That does not work (yet) :-)

:(
It's a really interesting feature.

Yes, I manually downloaded the trained bot from my VPS and let it check this post, should be worth 70 SBD and 170 votes :-D Yeah!

Maybe it's getting listed tomorrow as number one truffle :-D
There's some nitty gritty details, though, that I did not mention, so if this post makes more than 10SBD it won't be listed as truffle anymore.

I think this is one of my favorito curators services already. Not because It featured one of my posts, but It is a really interesting concept.

You have collected your daily Power Up! This post received an upvote worth of 0.29$.
Learn how to Power Up Smart here!

Maybe I will include this feature into the batch job instead of making a service. This means if you call @trufflepig manually, in the worst case you have to wait 24 hours. Yet, this makes my life much easier because I do not have to deal with concurrency issues of having the bot upvoting and commenting under the top list truffles and also commenting and upvoting on demand :-D.

That would be awesome, even if it would take 24 hours for the analysis.
As soon as i have an enough amount of SP i will definetly delegate to this bot.

Nice, thank you! Btw, including it in the batch job has the advantage that the bot won't comment twice under your post in case you did make into the truffle top list. Moreover, making it into the top list will also yield a higher vote from the bot than calling it manually.

Huh? Seems like I already voted on this post, thanks for calling anyway!

Let's see if this works. Hasn't been merged to master yet, but the server is operating on a beta branch now. We'll now for sure tomorrow. So, here it goes:

Glad to know @trufflepig

Short update on the roadmap:

I want to conduct further experiments with different ML regressors as well as feature encodings. I already made some experiments using Doc2Vec instead of LSI. But this was not very fruitful. A more thorough investigation may improve the bot's judgment further.

I did this and improved the bot slightly. From now on the LSA is not only computed over tokens, but over bigrams of tokens as well.

I also tried trigrams and 4grams as well as skip-grams, but they did not improve the bot's performance.

I'm currently working on the @trufflepig call a pig feature, afterwards I'll focus on the recommendation system.

I like the idea of calling the bot to any post to make a prediction :)

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

Great, thanks!

Hey @smcaterpillar I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • This is your first accepted contribution here in Utopian. Welcome!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Thanks a lot :-)

It's a great concept! Nice to see some background and a full feature road-map.

Thanks to @josephsavage, this post was resteemed and highlighted in today's edition of The Daily Sneak.

Thank you for your efforts to create quality content!