# Data modeling, restructuring, analysis, fuzzy cases, learning

 This is more for overview of my own than for teaching or exercise. Arithmetic · 'elementary mathematics' and similar concepts Set theory, Category theory Geometry and its relatives · Topology Elementary algebra - Linear algebra - Abstract algebra Calculus and analysis Logic Semi-sorted  : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort Math on data: Statistics as a field some introduction · areas of statistics types of data · on random variables, distributions Virtues and shortcomings of... on sampling · probability glossary · references, unsorted Footnotes on various analyses Other data analysis, data summarization, learning Data modeling, restructuring, analysis, fuzzy cases, learning Data massage Data clustering · Dimensionality reduction · Fuzzy coding, decisions, learning · Optimization theory, control theory Connectionism, neural nets · Evolutionary computing

 This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

## Concepts & glossary

The curse of dimensionality is, roughly, the idea that when you add a dimension to your model you need proportionally more data for decent training of that model. Similarly, since the volume increases so fast, you probably have a sparsity problem. It's a fairly exponential problem, so

### Stochastic processes, deterministic processes, random fields

A deterministic process deals with possible determined cases, no unknowns or random variables.

A stochastic process (a.k.a. random process) allows indeterminacy, typically by working with probability distributions.

A lot of data is stochastically modeled, because you only need partial data and can generally only get partial data.

(Models mixing deterministic and stochastic processes are often called hybrid models)

A random field basically describes the generalization that happens when the parameter (dependent variable) is not necessarily time, or one-dimensional, or real-valued.

### Types of problems

• Clustering points out regions or groups of (mutual) similarity, and dissimilarity from other groups.
clustering may not deal well with future data of the same sort, unless some care has been taken, so may not be the best choice of a learning/predicting system
• Vector quantization: Discretely dividing continuous space into various areas/shapes
which itself can be used for decision problems, labeling, etc.
• Dimensionality reduction: projecting attributes into lower-dimensional data
where the resulting data is (hopefully) comparably predictive/correlative (compared to the original)
The reason is often to eliminate attributes/data that may be irrelevant or too informationally sparse
• Feature extraction: discovering (a few nice) attributes from (many) old ones, or just from data in general.
Often has a good amount of overlap with dimensionality reduction
• others...

### Markov property

the Markov property is essentially that there is no memory, only direct response: that response of a process is determined entirely by its current state (and input, if you don't already define that as part of the state).

More formally, "The environment's response (s,r) at time t+1 depends only on the Markov state s and action a at time t" 

There are many general concepts that you can make stateless, and thereby Markovian:

• A Markov chain refers to a Markov process with finite, countable states 
• A Markov random field 
• A Markov logic network 
• A Markov Decision Process (MDP) is a decision process that satisfies the Markov property
• ..etc.

### Underfitting and overfitting (learners)

Underfitting is when a model is too simple to be good at describing all the patterns in the data.

Underfitted models and learners may still generalize very well, and that can be intentional, e.g. to describe just the most major patterns.

It may be hard to quantify how crude is too crude, though.

Overfitting often means the model is allowed to be so complex that a part of it describes all the patterns there are, meaning the rest ends up describing just noise, or insignificant variance or random errors in the training set.

A little overfitting is not disruptive, but a lot of it often is, distorting or drowning out the parts that are actually modeling the major relationships.

Put another way, overfitting it is the (mistaken) assumption that convergence in the training data means convergence in all data.

There are a few useful tests to evaluate overfitting and underfitting.

### supervised versus unsupervised systems (learners)

Supervised usually means the training process is suggested or somehow (dis)approved. Usually it refers to having annotated trainign data, sometimes to altering it. Example: Classification of documents, basic neural network back-propagation, least-squares fitting, operator cloning

Unsupervised refers to processes that work without intervention. For example, self-organizing maps, or clustering documents based on similarity needs no annotation.

Semi/lightly supervised usually means there is an iterative process which needs only minimal human intervention, be it to deal with a few unclear cases, for information that may be useful, or such.

### Inductive versus deductive systems (learners)

Inductive refers to training purely from data.

Deductive refers to also having a theory about the domain domain.

Inductive learning can be approached as a search in a hypothesis space.

## Statistical modeling

Data fusion means merging data from various sources, e.g. various sensors. This often implies some sort of modelling.

### Regression analysis

Regression glossary:

• linear regression models a linear predictor function
typically a least squares estimator
• simple regression indicates a single independent/explanatory variable