EDA Data Predictive Analysis
Predictive Analysis (EDA) using 4 plots tests
https://github.com/jean-francoisgiraud/Probabilistic-Predictability
Objective: This method can be used to determine if data can be used for predicting or concluding and to what degree of confidence.
Uses: STEM, quality management, decision making, finances, economy, crypto-currencies investments.
It uses data statistics instead of heuristics (heuristic = an approach to problem solving, learning, or discovery that employs a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals).
The Exploratory Data Analysis approach does not impose deterministic or probabilistic models on the data. On the contrary, the EDA approach allows the data to suggest admissible models that best fit the data.
IF the four assumptions are true (1) random sampling from (2) a fixed location [mean] with (3) a fixed distribution [standard deviation] and (4) with a fixed variation
THEN the data is parametric [it has a known distribution] AND
THEN probabilistic predictability is achieved, the process is in statistical control, repeatable and can be modeled [Yi=C+Ei=average+error] to make scientifically/legally valid predictions and conclusions [based on probability], for example : data will be Y +/- error 19 out of 20 times. Control limits can be established and outliers determined/rejected, otherwise all data is significant, non conforming results cannot be rejected arbitrarly, they must be rejected on the basis of data in a statistically controlled process.
ELSE the process is unpredictable, out of control, drifting and no conclusions-judgements [past] or predictions [future] can be made, repeating the tests will yield different and unrelated results [there could be other unknown and unknownable variables not accounted for]
IF the sample is not randomly selected THEN it is biased and not representative of the population it claims to represent AND not decisions can be made
This 'unpredictable area-domain' is where statistics are invalid, this is the quadrant where statistics are misleading [create a false sense of security] and Taleb's 'black swan' or 'turkey in november' area where anything can happen and the dramatic and unexpected happens (the majority of human experiences and language (politics values) is undebatable unpredictable)
.. Unknown or changing distribution of a complex process make accurate and exact predictions impossible
Only take direct action on the process or operator for special causes of variation [it could be a meaningful signal]. To reduce variation due to common causes [noise] actions must be taken on the system by management.
Test the assumptions with 4plots [run sequence, histogram, lag plot, normal probability plots], if true then develop a model for the system, the objective is to characterize and model the function [regression, forecast function]
Signal to Noise (SN) ratio vs time (T) in cryptocurrencies and stocks:
The longer period the data is from the better signal to noise ratio.
For example: Stock market and in cryptocurrencies (personal observations and approximations).
For T=1 year, SN= 1/1.
For T=1 day, SN= 0.05/0.95.
For T=1 hour, SN= 0.005/0.995.
- Daily or hourly 'news' are at best irrelevant at worst misleading more than 95% of the time (2 sigmas (standard deviation)).
- These are just my observations from publicly available data (not financial advice)
REF (1):
Predictability is an all-important goal in science and engineering. If the four underlying assumptions hold, then we have achieved probabilistic predictability--the ability to make probability statements not only about the process in the past, but also about the process in the future. In short, such processes are said to be "in statistical control".
If the four assumptions are valid, then the process is amenable to the generation of valid scientific and engineering conclusions. If the four assumptions are not valid, then the process is drifting (with respect to location, variation, or distribution), unpredictable, and out of control. A simple characterization of such processes by a location estimate, a variation estimate, or a distribution estimate inevitably leads to engineering conclusions that are not valid, are not supportable (scientifically or legally), and which are not repeatable in the laboratory.
Because the validity of the final scientific/engineering conclusions is inextricably linked to the validity of the underlying univariate assumptions, it naturally follows that there is a real necessity that each and every one of the above four assumptions be routinely tested.
Extremes data do not necessarily mean that they are due to a special cause [especially if the process is not in statistical control]. They can still be part of the normal variation of the process.
REF(1) . SEH Statistics Engineering Handbook
https://www.itl.nist.gov/div898/handbook/
Congratulations @jfg! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
You made your First Vote
Award for the number of posts published
You published 4 posts in one day
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
Do not miss the last announcement from @steemitboard!