The first of the three main classes of dominant-residual LFM’s (12.12) are regression LFM’s. In regression LFM’s, the factors are given exogenously, and the loadings are constructed.
The purpose of regression models is explanatory: given a large or small (even one) number of target variables , the goal is to explain as much as possible of the randomness in in terms of a few () observable factors .
Regression LFM’s are also known as “macroeconomic” LFM’s, because in some applications the factors are macroeconomic variables, such as interest rates, stock market returns, etc.
A regression , or macroeconomic linear factor model (LFM), for an -dimensional target variable is a dominant-residual decomposition (12.12)
In (12.35) the target variables and factors are observable, in that we can assume known the equivalence class of their joint distribution (12.7), and thus their joint expectations and covariances. Then, the loadings matrix is constructed in such a way to maximize the r-squared of the given factors.
More precisely, let us start with:
i) a symmetric and positive-definite matrix that defines the r-squared objective (12.10);
ii) and a number of observable factors.
Then a regression LFM is a dominant-residual LFM (12.13)
where is the Riccati root of (27.388), and the constraints are:
i) the factors are given exogenously for a suitable vector ;
ii) the residuals have zero expectation , which with i) implies .
Therefore, the constraints become
Then the matrix is optimized in (12.36) to yield , which maximizes the r-squared.
Regression LFM’s are arguably the most implemented models for supervised learning because of their intuitive features (Section 12.6.13). Furthermore, regression LFM’s can be mapped from the affine format (12.35) to the equivalent linear format , by considering the constant as additional factor, see Section 14.1.4.
Consider a univariate target variable and one observable factor . Suppose the variables are jointly normal , where
represents, among all possible lines
such that , the one that best fits
the joint distribution represented
by a large set of simulations .
Then, we observe how the area corresponding to the squared errors (red) is smaller than the area corresponding to (blue) for two arbitrary simulations . This is not surprising since the regression parameters are those that materialize on average the least squared errors (12.36).
The regression parameters define the best-fit line in a same way as the expectation and covariance define the best-fit ellipsoid in Example 35.23.
Note that the optimal loadings do not depend on the scale matrix specifying the r-squared (12.10).
Note that all that matters to compute the optimal loadings (12.40) and the optimal shift (12.41) are the first two moments of the joint distribution . Hence, any result holds on an equivalence class of distributions identified by the first two moments (12.7).
The regression prediction (12.3) becomes
This equation shows that the regression prediction lives in an -dimensional subspace, embedded in the -dimensional space of the original target variable and factors , see Figure 12.5.
where is the covariance matrix of target variables and factors, and is the covariance matrix of the factor.
Since a model that fits well the target variable must have a high r-squared, i.e. , the factors should be as uncorrelated among each other as possible. If there are high correlations among the factors, or if there is high collinearity, the matrix would be ill-defined [W] and possibly become singular. Geometrically, this means that the space of the prediction (12.44) is not properly identified, see Figure 12.5.
Consider a univariate target variable and two observable factors . Suppose the variables are jointly normal , where
In Figure 12.5 we compare the target variable
and the corresponding
regression prediction (12.44),
which is the projection of
onto the regression plane. More precisely
-) In the left plot we show the simulations of , generated by varying the entries of the correlation matrix , along with the regression plane
-) In the right plot we show the corresponding projected simulations of the prediction .
Notice that when the factors are collinear, i.e. , the predicted simulations (green dots) shrink along a line, which is a one-dimensional subspace of the regression plane (see the right plot). This means that any two-dimensional regression plane passing through that line is admissible (see the left plot), and so the space of the prediction is not properly identified.
Furthermore, the r-squared expression (12.46) supports the intuition that a “good” model must display high overall correlations between the target variable and factors . Indeed, consider a simple case with one univariate target variable and one factor which are jointly normal with correlation . If we set in the r-squared (12.46), we obtain that the r-squared is the squared correlation between the factor and the target variable
From the regression prediction (12.44) we can compute the residuals , or
The residuals are uncorrelated with the factors E.12.3
and thus the factors are systematic (12.16). As a matter of fact the optimal matrix of regression loadings (12.40) is, among all the matrices , the only one which makes the residuals uncorrelated with the factors . See Section 12.6.5 for more on this profound result.
Finally, the explicit expression of the covariance of the residuals reads E.12.4
Hence, the residuals are not uncorrelated with each other
Similar to the r-squared (12.46), the covariance expression (12.46) supports again the intuition that a “good” model must display high overall correlations between the target variable and factors . Indeed, consider a simple case with one univariate target variable and one factor which are jointly normal with correlation . Then, the residual variance, which reads E.12.13
is minimal when the r-squared (12.48) is maximal, or .
Example 12.9. We continue from Example 12.8. The residual and the factors are jointly normal , where
The linear regression (12.35) allow us to introduce a projection operator , which can be thought of as the linear counterpart of the operation of conditioning; and two ensuing summary statistics, which can be thought of as the linear counterpart of the conditional distribution (31.48).
Just as the conditional distribution represents all that can be inferred about from i) knowledge of and ii) the joint distribution , the two summary statistics and that we introduce below represent all that can be inferred about from i) knowledge of and ii) the first two moments (12.7) of the joint distribution , as we summarize in the below.
The above table is of capital importance to define the two and only two approaches currently used in statistics and finance.
To define the first summary statistic, let us interpret the linear regression prediction (12.65) as () linear projection
The nomenclature “linear projection” is justified because the linear prediction (12.65) can be interpreted geometrically at the same time as best prediction (27.173) or, equivalently, as orthogonal projection (27.155), as we will show later in (12.66).
To introduce the second summary statistic, let us interpret the residual in linear regression (12.35) as weak innovation
where the nomenclature “weak” is due to being uncorrelated with (12.71), as opposed to being independent of (“strong” innovation).
Given two target random vectors and , let us denote the residuals, or weak innovations (12.59) with respect to a common random vector by and respectively. We say that are partially orthogonal with respect to if their residuals are orthogonal, or equivalently, if the respective block of the partial covariance (12.61) is zero
Note that we use on purpose the plain notation , since residuals have zero mean by construction, see (12.70), and hence their orthogonality is equivalent with respect to either or . Partial orthogonality (12.63) lies at foundation of systematic-idiosyncratic models ( Section 12.1.3) which generalizes to graphical models ( Section 15.3).
The linear regression solution (12.36) has two geometrical interpretations: as the point in the linear span of the factors closest to the target, and as an orthogonal projection of the target onto the span of the factors. In the Euclidean geometry induced by the expectation inner product, the two concepts of least distance and orthogonal projection are the same, see Figure 12.6. Refer to Section 14.1.10 for non-linear generalizations.
For a given vector , and a vector with finite second moments (35.139), let us define the linearized information set as the space of all affine transformations of the random variable
where are arbitrary vectors and are arbitrary matrices. The space is a linear subspace of the space of -dimensional random vectors with finite first two moments (35.139) .
The definition of linear regression (12.36) shows that the linear projection (12.57) is the best prediction (27.173) of provided by elements of the linearized information set (12.64) with respect to the (squared) expectation distance (35.149)
b) the linearized information set