How to Use AI to Find Articles with Ruby
Note: This is pure magic and highly experimental. In a nutshell, we're going to look a the trending page and try to predict which new posts will reach trending. To do this, we're going to use ID3. According to Wikipedia:
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains.
In Ruby, we can use the ID3 algorithm through the ai4r
gem.
Ok, it's not really magic. So, how does it work? I have ID3 look at some specific attributes of top 100 trending posts. Specifically:
author_reputation percent_steem_dollars promoted category net_votes
Based on these attributes, I have it predict total_pending_payout_value
of a new post. If total_pending_payout_value
can be predicted, we will display the difference between the prediction and the current pending payout.
As always, we use Radiator with bundler
. You can get bundler
with this command:
$ gem install bundler
I've tested it on various versions of ruby. The oldest one I got it to work was:
ruby 2.0.0p645 (2015-04-13 revision 50299) [x86_64-darwin14.4.0]
First, make a project folder:
$ mkdir radiator
$ cd radiator
Create a file named Gemfile
containing:
source 'https://rubygems.org'
gem 'radiator', github: 'inertia186/radiator'
gem 'ai4r' # Adds general machine learning capabilities.
Then run the command:
$ bundle install
Create a file named ai-scan.rb
containing:
require 'rubygems'
require 'bundler/setup'
Bundler.require
def to_rep(raw)
raw = raw.to_i
level = Math.log10(raw.abs)
level = [level - 9, 0].max
level = (level * 9) + 25
level.to_i
end
def base_value(raw)
raw.split(' ').first.to_i
end
def symbol_value(raw)
raw.split(' ').last
end
api = Radiator::Api.new
names = ARGV
data_labels = %w(
author_reputation percent_steem_dollars promoted category net_votes
total_pending_payout_value
)
prediction_label = data_labels.last
options = {
limit: 100
}
options[:tag] = ARGV.first if ARGV.any?
response = api.get_discussions_by_trending(options)
trending_comments = response.result
data_items = trending_comments.map do |comment|
data_labels.map do |label|
case label
when 'author_reputation'; to_rep comment[label]
when 'promoted'; base_value comment[label]
when 'total_pending_payout_value'; base_value comment[label]
else; comment[label]
end
end
end
data_set = Ai4r::Data::DataSet.new data_labels: data_labels, data_items: data_items
id3 = Ai4r::Classifiers::ID3.new.build(data_set)
response = api.get_discussions_by_created(options)
new_comments = response.result - trending_comments
predictions = new_comments.map do |comment|
next unless comment.mode == 'first_payout'
data_item = data_labels.map do |label|
case label
when 'author_reputation'; to_rep comment[label]
when 'promoted'; base_value comment[label]
when 'total_pending_payout_value'; base_value comment[label]
else; comment[label]
end
end
prediction = (id3.eval(data_item) rescue nil)
next if prediction.nil?
{
difference: prediction - base_value(comment.total_pending_payout_value),
symbol: symbol_value(comment.total_pending_payout_value),
url: "https://steemit.com#{comment.url}"
}
end.reject(&:nil?)
if predictions.any?
puts "Predicting the following payouts will rise by:"
predictions.sort_by { |p| p[:difference] }.each do |prediction|
puts "#{prediction[:difference]} #{prediction[:symbol]}: #{prediction[:url]}"
end
else
puts "Nothing to predict."
end
Then run it:
$ ruby ai-scan.rb
The expected output will be something like this:
Predicting the following payouts will rise by:
0 SBD: https://steemit.com/history/@steemizen/today-in-history-uss-arkansas
0 SBD: https://steemit.com/steem/@ozchartart/usdsteem-btc-daily-poloniex-bittrex-technical-analysis-market-report-update-162-jan-14-2017
10 SBD: https://steemit.com/travel/@writingamigo/traveler-s-observations-the-origins-of-habits-how-environement-forces-us-to-believe-that-it-is-our-fault
13 SBD: https://steemit.com/fiction/@johnjgeddes/tempest-and-tea-rediscovering-the-magic-within-part-1-of-2
15 SBD: https://steemit.com/travel/@exploretraveler/photo-of-the-day-skagway-alaska
17 SBD: https://steemit.com/news/@contentjunkie/spacex-launches-first-rocket-since-explosion
17 SBD: https://steemit.com/food/@anti-sophist/bold-lamb-loin-chops-and-basil-potatoes-2017114t195031380z
17 SBD: https://steemit.com/pizzagate/@gizmosia/the-video-the-world-must-watch-chilling-info-re-child-trafficking-posted-today
17 SBD: https://steemit.com/minecraft/@thedonutguy7/how-to-download-a-minecraft-map-for-windows
17 SBD: https://steemit.com/fly/@altcointrader77/flycoin-in-the-hands-of-a-trusted-few
17 SBD: https://steemit.com/fiction/@internutter/challenge-01476-d015-historical-hysterical-first
17 SBD: https://steemit.com/animal/@favorit/nature-that-surrounds-us-in-the-animal-world-black-stallion-23
18 SBD: https://steemit.com/film/@movie-online/confidential-secret-market-1974-romance-history
18 SBD: https://steemit.com/life/@lukestokes/day-6-update-the-wim-hof-method
18 SBD: https://steemit.com/kr/@leesunmoo/6r1hns
19 SBD: https://steemit.com/challenge30/@franks/challenge30-deep-space-mining-unobtainium
You can also pass a tag:
$ ruby ai-scan.rb photography
The expected output will be something like this:
Predicting the following payouts will rise by:
0 SBD: https://steemit.com/travel/@koskl/visiting-cusco-peru
0 SBD: https://steemit.com/nature/@zaskia/beautiful-flower
0 SBD: https://steemit.com/photography/@distantsignal/shooting-milkshake-web-series-on-vintage-russian-lenses
0 SBD: https://steemit.com/photography/@chrissysworld/the-sky-burns-the-angels-flee-der-himmel-brennt-die-engel-fliehn-english-deutsch
0 SBD: https://steemit.com/photography/@klava/white-truffle
0 SBD: https://steemit.com/photography/@rynow/sunken-fish-trailer
0 SBD: https://steemit.com/food/@lonilush/traditional-balkan-cheese-pie-burek-original-recipe-with-pictures
0 SBD: https://steemit.com/nature/@riostarr/mushrooms-on-dead-wood
1 SBD: https://steemit.com/photography/@richar/life-and-death-on-wall-street
1 SBD: https://steemit.com/photography/@xntryk1/swapmeet-finds-640
5 SBD: https://steemit.com/photography/@jasonrussell/jacks-fork-river-10-pictures
5 SBD: https://steemit.com/photography/@kalemandra/reflections
17 SBD: https://steemit.com/photography/@briansss/check-it-out-my-photo-album-of-my-trip-through-venezuela
17 SBD: https://steemit.com/food/@alizee/pecal-tubers-vegetables-papaya-flower
Either way, you can use these results as voting suggestions because the ID3 algorithm thinks these articles correlate to a future payout prediction.
Under the hood, here's a rough explanation of what's going on. We take the trending posts, and just extract certain fields as inputs to ID3. The inputs become:
author_reputation | percent_steem_dollars | promoted | category | net_votes | total_pending_payout_value |
---|---|---|---|---|---|
52 | 10000 | 0 | romance | 146 | 16 |
58 | 10000 | 0 | story | 160 | 16 |
67 | 0 | 0 | science | 162 | 16 |
58 | 10000 | 0 | travel | 178 | 16 |
60 | 10000 | 0 | gaming | 166 | 16 |
54 | 10000 | 0 | fiction | 141 | 15 |
54 | 10000 | 0 | food | 163 | 15 |
53 | 10000 | 0 | art | 167 | 15 |
67 | 0 | 0 | japan | 108 | 15 |
61 | 10000 | 0 | poker | 21 | 15 |
59 | 10000 | 0 | til | 158 | 15 |
63 | 10000 | 0 | music | 165 | 15 |
60 | 10000 | 0 | art | 160 | 15 |
59 | 10000 | 0 | aceh | 155 | 15 |
59 | 10000 | 0 | writing | 147 | 15 |
55 | 10000 | 0 | life | 160 | 15 |
51 | 10000 | 0 | painting | 148 | 15 |
57 | 0 | 1 | life | 130 | 15 |
59 | 10000 | 0 | travel | 163 | 15 |
ID3 takes the above inputs and then compares them all to each new post, looking for correlations. Then it tries to predict the final total_pending_payout_value
for the new posts.
For instance, it might notice that authors with a reputation of 59
, posting in til
, tend to have a total_pending_payout_value
of 15
. So if a new post matches, it'll make that prediction.
But then, it notices a correlation between certain percent_steem_dollars
, promoted
, and category
posts, but only when the category
is science
. It's that flexible.
As an analogy, it's a little bit like weather prediction: "In this area, on this day, for the last 100 years, when the temperature is x
and the humidity is y
, it rains z
percent of the time."
You will notice, I specifically exclude the author name from the prediction inputs. If you want to include it, you can add it yourself by modifying data_labels
in the script and adding author
to the beginning.
While including author
might help ID3 make better predictions, personally, I'm not interested in correlating the author name. We already have enough of those kinds of tools (albeit, without ID3). I want ID3 to be indifferent about the author and try to make its prediction on a more subtle inputs, which is what it's designed to do.
Interesting.
Cool post. Did you measure corelation between predicted payout and the real one?
I'm still looking at it. When I originally posted this post, my script said I would earn $17. Then, 5 minutes later, it couldn't make any more predictions about this post.
The other samples in this post seem to correlate a little better than chance, on cursory analysis. I'll do a more in-depth post later.
Very helpful post! Interesting too.