You are viewing a single comment's thread from:

RE: How Much of the Rewards Pool is Paid out by BitBots Votes V's Organic Votes

in #utopian-io7 years ago

I have been trying a type of market basket analysis in R - but getting it wrong so far.. Im trying something like if A votes for Y all the time, who else also votes for Y all the time.. Im very new to R, this will take me months to master lol

Sort:  

R Programming for Data Science by Roger D. Peng might be handy :-)

He and Jeff Leek have a dozen or so courses on R, stats and data science on coursera. I don't know how coursera works right now (haven't used it in years), but a few years back I did enjoy two courses by them:

  1. Exploratory Data Analysis
  2. R Programming

I'm doing a data science thiny with microsoft and EdX....time is my biggest constraint. but that you for the links because extra references are so handy to have, especially that book..nice :-)

I suspect that you are going to really enjoy working on analysis in R once you get your feet wet. Thinking about these problems from a procedural point of view really throws certain aspects into sharp relief. I find that it really tests my assumptions about what I should be seeing and expectation versus what I am seeing and why I'm seeing those things.

Though you have to be careful with the "Alice votes for Bob all the time, who else votes for Bob all the time?" form of inquiry, because it is perfectly reasonable for human beings to act like that. Especially on Steemit, where providers of anything outside of talk about cryptocurrency in general and steem in particular are rare, it is very easy for real communities of people to end up largely voting for each other if they are interested in the same niche subject.

But that's okay, because you would notice that very quickly once you started pulling those clusters out. This is how we learn.

The trick might be far simpler: You have to look at the transfers+memo as URL and if then in return comes an upvote to that URL, you have a bot working.The trick might be far simpler: You have to look at the transfers+memo as URL and if then in return comes an upvote to that URL, you have a bot at work. Of course I have no idea how to filter that reliably or squeeze it into R or anything else. ;-)

The real problem with doing it that way is trying to actually sort through that much data, because you have to have both all of the transfers and memos and all of the votes in order to possibly have a positive hit.

If these bot designers were smart, they would start requiring that the memo be sent with an encrypted hashtag at the beginning so that casual observation couldn't make out what the targeted URL is from outside the recipient. Some of them may be doing that; that's outside of my personal experience.

That is a lot of data to be slinging around the network, which is the problem I've been running into a lot lately. It might be possible, but it's definitely not a simple trick.

You caught me there, it's probably too simplistic what I would do:

  • you need the list with the transfer-amount+URL+timestamp1 to the bot
  • plus the list with the bot upvote-percentage+URL+timestamp2
  • then you create a table with the columns for URL, timestamp1, transfer-amount, upvote-percentage
  • then you fill the table with the first list
  • and after that you update the table by adding the 2nd list where the URL is the same and timestamp2 > timestamp1, because the transfer comes before the upvote
  • finally you delete all rows that have empty cells.

Done. But again: This is the approach of a lousy SQL amateur;-)

See, the problem is not this process, which is fairly straightforward – it's generating the list of transfer amounts and URLs along with bot up votes percentage and URL. In order to generate those lists in the first place, you have to do a fair amount of ugly digging and parsing.

It's that part that's really the issue. Figuring out what the signs of those things are and extracting them.

And then you have to do it for every single bot, which means that you have a fair number of transactions that are going to have to be hitting the server in order to straighten everything out.

It's a lot of data. And ultimately – I'm not sure that it really tells us anything that we don't already know.

It might actually be more efficient to simply query the lot of all posts made over the last week and have them give their active_votes attribute up and do all of the parsing on that. If nothing else it keeps the query simple.