You are viewing a single comment's thread from:

RE: *TrufflePig*: Introducing the Artificial Intelligence for Content Curation and Minnow Support

in #steemit6 years ago

Looks like a cool project @smcaterpillar . Looking forward to read more about it!
May I give some thoughts which comes in my mind?
Have you removed common words like he, she, the, a ...? In your found topics it seems they are still in. Also html tags like "href" are in. Removing them could already improve your features.
Have you done a e.g. 5 fold cross validation of the training and test set? This often gives a more realistic view of the predicted results.
Did you have a look at the far outliers in the prediction and reviewed them manually? I think this could give interesting insights to improve your features.

Thanks for sharing :)

Sort:  

Hi, these are some good remarks, thank you. Let me address them one by one:

"Have you removed common words like he, she, the, a ...?" In your found topics it seems they are still in.
Yes, but rather arbitrarily. I just filtered any word that appears in more than one third of the training set posts. Apparently this has left she and he in there, but at least removed a and the. I have to try to lower the threshold, maybe to 10 or 20% of all documents. Definitely worth trying to find a sweet spot via cross validation.

Also html tags like "href" are in. Removing them could already improve your features.
Yes, damn, I wrote a bunch of regular expression filters, I missed href, though. Will be included in the next version.

Have you done a e.g. 5 fold cross validation of the training and test set? This often gives a more realistic view of the predicted results.
I haven't done any cross validation, yet. But will definitely do to tune some hyper-parameters such as number of topics or the word filter threshold. I have to see if it makes sense to also tune some forest parameters like max_depth, max_leaf_nodes, or percentage of features at each split. What I have done though is to run the model a couple of times with a different RNG seed to see if results are consistent and robust (they are).

Did you have a look at the far outliers in the prediction and reviewed them manually? I think this could give interesting insights to improve your features.
I haven't done a very thorough investigation, yet. However, the truffles you are seeing in the post above are, by definition, some outliers, they have the highest difference between real payout and predicted.

Thanks for the feedback, really appreciated!