### 12.2 Regression LFM’s

**Key points**

- In regression LFM’s (12.35) the factors are observable and loadings are constructed to maximize the r-squared (12.35)-(12.36).
- Regression LFM work better when there is high correlation among target variables and factors and low correlation among factors (12.46).
- The regression prediction (12.44) has dual geometrical interpretation as best linear prediction (12.67) and orthogonal projection (12.68) of the target variables onto the linear span of factors.

The first of the three main classes of dominant-residual LFM’s (12.12) are regression LFM’s. In regression LFM’s, the factors are given exogenously, and the loadings are constructed.

The purpose of regression models is explanatory: given a large or small (even one) number of target variables , the goal is to explain as much as possible of the randomness in in terms of a few () observable factors .

Regression LFM’s are also known as “macroeconomic” LFM’s, because in some applications the factors are macroeconomic variables, such as interest rates, stock market returns, etc.

Finally, regression LFM’s naturally generalize to supervised learning models (Sections 14.1-15.1), as showed in Figure 11.1.

#### 12.2.1 Definition

A **regression **, or **macroeconomic linear factor model** (LFM), for an
-dimensional
target variable
is a dominant-residual decomposition (12.12)

(12.35) |

In (12.35) the target variables and factors are observable. In particular, we assume known the equivalence class of their joint distribution (12.7), and thus their joint expectations and covariances. Then, the loadings matrix is constructed in such a way to maximize the r-squared of the given factors.

More precisely, let us start with:

i) a symmetric and positive-definite matrix that defines the r-squared objective (12.10);

ii) and a number of observable factors.

Then a regression LFM is a dominant-residual LFM (12.13)

where is the Riccati root of (27.386), and the constraints are:

i) the factors are given exogenously for a suitable vector ;

ii) the residuals have zero expectation , which with i) implies .

Therefore, the constraints become

(12.37) |

Then the matrix is optimized in (12.36) to yield , which maximizes the r-squared.

In the estimation context the dominant-residual regression framework (12.36) becomes the ordinary least squares optimization (12.92).

Regression LFM’s are arguably the most implemented models for supervised learning because of their intuitive features (Section 12.6.13). Furthermore, regression LFM’s can be mapped from the affine format (12.35) to the equivalent linear format , by considering the constant as additional factor, see Section 14.1.4.

Example 12.4. **Regression loadings as ordinary least squares**

Consider a univariate target variable and one observable factor . Suppose the variables are jointly normal , where

(12.38) |

In Figure 12.4, we show how the regression line, as specified by the regression parameters in (12.36)

(12.39) |

represents, among all possible lines
such that , the one that best fits
the joint distribution represented
by a large set of simulations .

Then, we observe how the area corresponding to the squared errors
(red) is smaller than the area
corresponding to (blue) for
two arbitrary simulations .
This is not surprising since the regression parameters
are
those that materialize on average the least squared errors (12.36).

The regression parameters
define the best-fit line in a same way as the expectation and covariance define the best-fit ellipsoid
in Example 35.23.

#### 12.2.2 Solution: factor loadings

Given the constraints (12.37), the key variables to solve in the regression LFM optimization (12.36) are the loadings, which can be computed analytically E.12.1

(12.40) |

Note that the optimal loadings do not depend on the scale matrix specifying the r-squared (12.10).

Given the optimal loadings (12.40), we obtain from the zero expectation constraint (12.37)

(12.41) |

Note that all that matters to compute the optimal loadings (12.40) and the optimal shift (12.41) are the first two moments of the joint distribution . Hence, any result holds on an equivalence class of distributions identified by the first two moments (12.7).

Example 12.5. Consider a univariate target variable () with two observable factors (), which are jointly normal , where

(12.42) |

Then the parameter defined by the constraint (12.37) and the optimal loadings (12.40) read S.12.2

(12.43) |

#### 12.2.3 Prediction and fit

The regression prediction (12.3) becomes

(12.44) |

This equation shows that the regression prediction lives in an -dimensional subspace, embedded in the -dimensional space of the original target variable and factors , see Figure 12.5.

Example 12.6. We continue from Example 12.5. The regression prediction (12.44) is normally distributed , where S.12.2

(12.45) |

The r-squared (12.10) provided by the prediction (12.44) reads E.12.2

(12.46) |

where is the covariance matrix of target variables and factors, and is the covariance matrix of the factor.

Since a model that fits well the target variable must have a high r-squared, i.e. , the factors should be as uncorrelated among each other as possible. If there are high correlations among the factors, or if there is high collinearity, the matrix would be ill-defined [W] and possibly become singular. Geometrically, this means that the space of the prediction (12.44) is not properly identified, see Figure 12.5.

Example 12.7. **Regression linear factor model for different correlations**

Consider a univariate target variable and two observable factors . Suppose the variables are jointly normal , where

(12.47) |

In Figure 12.5 we compare the target variable
and the corresponding
regression prediction (12.44),
which is the projection of
onto the regression plane. More precisely

-) In the left plot we show the simulations of ,
generated by varying the entries of the correlation matrix
, along
with the regression plane

-) In the right plot we show the corresponding projected simulations of the prediction
.

Notice that when the factors are collinear, i.e.
, the
predicted simulations (green dots) shrink along a line, which is a one-dimensional subspace of the
regression plane (see the right plot). This means that any two-dimensional regression plane passing
through that line is admissible (see the left plot), and so the space of the prediction
is not
properly identified.

Through the explicit expression of the r-squared (12.46) we can determine the best pool of observable factors, see Section C.7.3.1.

Furthermore, the r-squared expression (12.46) supports the intuition that a “good” model must display high overall correlations between the target variable and factors . Indeed, consider a simple case with one univariate target variable and one factor which are jointly normal with correlation . If we set in the r-squared (12.46), we obtain that the r-squared is the squared correlation between the factor and the target variable

(12.48) |

Example 12.8. We continue from Example 12.6. The r-squared (12.46) reads S.12.2

(12.49) |

#### 12.2.4 Residuals features

From the regression prediction (12.44) we can compute the residuals , or

(12.50) |

The residuals are uncorrelated with the factors E.12.3

(12.51) |

and thus the factors are systematic (12.16). As a matter of fact the optimal matrix of regression loadings (12.40) is, among all the matrices , the only one which makes the residuals uncorrelated with the factors . See Section 12.6.5 for more on this profound result.

Finally, the explicit expression of the covariance of the residuals reads E.12.4

(12.52) |

Hence, the residuals are not uncorrelated with each other

(12.53) |

and thus they are not idiosyncratic (12.17). We discuss this pitfall further in Section 12.6.5.

Similar to the r-squared (12.46), the covariance expression (12.46) supports again the intuition that a “good” model must display high overall correlations between the target variable and factors . Indeed, consider a simple case with one univariate target variable and one factor which are jointly normal with correlation . Then, the residual variance, which reads E.12.13

(12.54) |

is minimal when the r-squared (12.48) is maximal, or .

Example 12.9. We continue from Example 12.8. The residual and the factors are jointly normal , where

(12.55) |

#### 12.2.5 Natural scatter specification

So far we have left unspecified the scale matrix that defines the r-squared (12.10) in the regression LFM optimization (12.36)-(12.37).

For any choice of we have always the same loadings (12.40) and hence same dominant-residual decompositions (12.12)-(12.13). For this reason the natural choice for is simply the identity

(12.56) |

#### 12.2.6 The projection operator

The linear regression (12.35) allow us to introduce a projection operator , which can be thought of as the linear counterpart of the operation of conditioning; and two ensuing summary statistics, which can be thought of as the linear counterpart of the conditional distribution (31.48).

Just as the conditional distribution represents all that can be inferred about from i) knowledge of and ii) the joint distribution , the two summary statistics and that we introduce below represent all that can be inferred about from i) knowledge of and ii) the first two moments (12.7) of the joint distribution , as we summarize in the below.

The above table is of capital importance to define the two and only two approaches currently used in statistics and finance.

To define the first summary statistic, let us interpret the linear regression prediction
(12.67)
as **(****)
linear projection**

(12.57) |

where in the notation of the optimal regression coefficients (12.41) and (12.40) we emphasized the dependence on the variables.

The nomenclature “linear projection” is justified because the linear prediction (12.67) can be interpreted geometrically at the same time as best prediction (27.173) or, equivalently, as orthogonal projection (27.155), as we will show later in (12.68).

From the linear projection (12.57) we can define the **(point) linear
prediction**, which is the linear projection (12.57) of the target variable
evaluated by means for
the specific realization

(12.58) |

To introduce the second summary statistic, let us interpret the residual
in
linear regression (12.35) as **weak innovation**

(12.59) |

where the nomenclature “weak” is due to being uncorrelated with (12.73), as opposed to being independent of (“strong” innovation).

We can gain more insight in the residual (12.59) through the
**(****)
linear loss matrix**, which is the linear projection (12.57), or regression projection, applied to the
squared cross-residuals

(12.60) |

The linear loss matrix (12.60) depends on the third order moments of the variables E.40.51 and thus we cannot know it with the only knowledge of the first two moments (12.7).

However, we can denote and compute the (linear,
)** partial
covariance matrix** as the expectation of the loss matrix projection (12.60), which reads E.40.51

(12.61) |

Note how, unlike the linear loss matrix (12.60), the partial covariance matrix (12.61) depends on the second order moments of the variables , which are known (12.7).

Finally, we can define the (linear, )
**partial correlation matrix** [W] as the correlation matrix (34.48) extracted from the partial
covariance matrix (12.61), or

(12.62) |

Given two target random vectors
and ,
let us denote the residuals, or weak innovations (12.59) with respect to a common random vector
by
and
respectively. We say
that are **partially
orthogonal** with respect to
if their residuals are orthogonal, or equivalently, if the respective block of the partial covariance
(12.61) is zero

(12.63) |

Note that we use on purpose the plain notation , since residuals have zero mean by construction, see (12.72), and hence their orthogonality is equivalent with respect to either or . Partial orthogonality (12.63) lies at foundation of systematic-idiosyncratic models ( Section 12.1.3) which generalizes to graphical models ( Section 15.3).

The partial covariance (12.61) between two entries of the -variate random vector with respect to all the remaining entries are the respective entries of the inverse covariance matrix [W]

(12.64) |

where denotes all the entries of , except for and . Therefore two entries are partially orthogonal (12.63) if and only if the respective entries of the inverse covariance matrix are zero

(12.65) |

The realtionship (12.65) naturally induces a graph structure among random variables, as we shall see in (15.188).

##### Geometrical interpretation

The linear regression solution (12.36) has two geometrical interpretations: as the point in the linear span of the factors closest to the target, and as an orthogonal projection of the target onto the span of the factors. In the Euclidean geometry induced by the expectation inner product, the two concepts of least distance and orthogonal projection are the same, see Figure 12.6. Refer to Section 14.1.10 for non-linear generalizations.

For a given
vector ,
and a vector
with finite second
moments (35.139), let us define the **linearized information set** as the space of all affine transformations of the
random variable