Data modeling, restructuring, analysis, fuzzy cases, learning
| This is more for overview of my own than for teaching or exercise.
Other data analysis, data summarization, learning
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)|
- 1 Concepts & glossary
- 1.1 Stochastic processes, deterministic processes, random fields
- 1.2 Types of problems
- 1.3 Markov property
- 1.4 Observability and controllability
- 1.5 Underfitting and overfitting (learners)
- 1.6 supervised versus unsupervised systems (learners)
- 1.7 Inductive versus deductive systems (learners)
- 1.8 Model-based versus model-free systems (learners)
- 2 Statistical modeling
- 3 Classifiers
- 4 Semi-sorted
- 5 Unsorted
Concepts & glossary
The curse of dimensionality is, roughly, the idea that when you add a dimension to your model you need proportionally more data for decent training of that model. Similarly, since the volume increases so fast, you probably have a sparsity problem. It's a fairly exponential problem, so
Stochastic processes, deterministic processes, random fields
A deterministic process deals with possible determined cases, no unknowns or random variables.
A stochastic process (a.k.a. random process) allows indeterminacy, typically by working with probability distributions.
A lot of data is stochastically modeled, because you only need partial data and can generally only get partial data.
(Models mixing deterministic and stochastic processes are often called hybrid models)
A random field basically describes the generalization that happens when the parameter (dependent variable) is not necessarily time, or one-dimensional, or real-valued.
Types of problems
Tasks are often one of:
- Clustering points out regions or groups of (mutual) similarity, and dissimilarity from other groups.
- clustering may not deal well with future data of the same sort, unless some care has been taken, so may not be the best choice of a learning/predicting system
- Vector quantization: Discretely dividing continuous space into various areas/shapes
- which itself can be used for decision problems, labeling, etc.
- Dimensionality reduction: projecting attributes into lower-dimensional data
- where the resulting data is (hopefully) comparably predictive/correlative (compared to the original)
- The reason is often to eliminate attributes/data that may be irrelevant or too informationally sparse
- see also #Ordination.2C_Dimensionality_reduction.2C_Factor_Analysis.2C_Multivariate_analysis
- Feature extraction: discovering (a few nice) attributes from (many) old ones, or just from data in general.
- Often has a good amount of overlap with dimensionality reduction
the Markov property is essentially that there is no memory, only direct response: that response of a process is determined entirely by its current state (and input, if you don't already define that as part of the state).
More formally, "The environment's response (s,r) at time t+1 depends only on the Markov state s and action a at time t" 
There are many general concepts that you can make stateless, and thereby Markovian:
- A Markov model is a stochastic model with the Markov property
- A Markov chain refers to a Markov process with finite, countable states 
- A Markov random field 
- A Markov logic network 
- A Markov Decision Process (MDP) is a decision process that satisfies the Markov property
Observability and controllability
Underfitting and overfitting (learners)
Underfitting is when a model is too simple to be good at describing all the patterns in the data.
Underfitted models and learners may still generalize very well, and that can be intentional, e.g. to describe just the most major patterns.
It may be hard to quantify how crude is too crude, though.
Overfitting often means the model is allowed to be so complex that a part of it describes all the patterns there are, meaning the rest ends up describing just noise, or insignificant variance or random errors in the training set.
A little overfitting is not disruptive, but a lot of it often is, distorting or drowning out the parts that are actually modeling the major relationships.
Put another way, overfitting it is the (mistaken) assumption that convergence in the training data means convergence in all data.
There are a few useful tests to evaluate overfitting and underfitting.
supervised versus unsupervised systems (learners)
Supervised usually means the training process is suggested or somehow (dis)approved. Usually it refers to having annotated trainign data, sometimes to altering it. Example: Classification of documents, basic neural network back-propagation, least-squares fitting, operator cloning
Unsupervised refers to processes that work without intervention. For example, self-organizing maps, or clustering documents based on similarity needs no annotation.
Semi/lightly supervised usually means there is an iterative process which needs only minimal human intervention, be it to deal with a few unclear cases, for information that may be useful, or such.
Inductive versus deductive systems (learners)
Inductive refers to training purely from data.
Deductive refers to also having a theory about the domain domain.
Inductive learning can be approached as a search in a hypothesis space.
Model-based versus model-free systems (learners)
Data fusion means merging data from various sources, e.g. various sensors. This often implies some sort of modelling.
- linear regression models a linear predictor function
- typically a least squares estimator
- simple regression indicates a single independent/explanatory variable