Consider the following structural equations:
store sales := f(store visits, x_1, x_2)
store visits := f(x_1, x_2)
Where as x_1, x_2 denotes promotional activity spending.
Note that x_1, x_2 are not the sole drivers behind store visits, we also have a baseline of store visits.
In this case: assume that we model store visits and store sales as:
log(store sales) := dayofweekbaseline + dayofmonthbaseline + b_3 * log(x_1) + b_4 * log(x_2) + b_5 * \hat log(store visits)
where as \hat denotes that the storevisits are predicted values from the equation below:
log(store visits) := dayofweekbaseline + dayofmonthbaseline + b_1 * log(x_1) + b_2 * log(x_2)
Now note that the baseline of store visits and store sales stems from the same underlying demand which is unobserved although we are trying to model it with e.g an UCM approach or just basic regression with dummies.
In the equation for store visits we aim to capture this demand by the baselines but notice now when we plug into these predicted values into the equation for store sales the estimate for b_5 will be severely biased since store visits now captures this underlying demand(which should have been attributed to the baselines in the store sales equation).
I been thinking about using copulas as latent instrumental variables to attack this issue, i wonder however if the normality tests for my predicted store visit values would be sufficient, it bothers me that we assume a standard error for the store visits equation but that does not dictate that the actual predicted values would be normally distributed or am i missing out on something..
Any tips on how to go on with this?
Endogeneity issues in recursive SEM (use copulas?)

 PLS Junior User
 Posts: 1
 Joined: Thu Sep 21, 2023 6:05 am
 Real name and title: jack anderson
Re: Endogeneity issues in recursive SEM (use copulas?)
This method has some advantages over the traditional instrumental variable methods, such as allowing for discrete endogenous regressors, solving the slope endogeneity problem, and relaxing the exogeneity assumption of the instruments.jack anderson wrote: ↑Mon Sep 25, 2023 10:33 am Consider the following structural equations:
store sales := f(store visits, x_1, x_2)
store visits := f(x_1, x_2)
Where as x_1, x_2 denotes promotional activity spending.
Note that x_1, x_2 are not the sole drivers behind store visits, we also have a baseline of store visits.
bloxd io
In this case: assume that we model store visits and store sales as:
log(store sales) := dayofweekbaseline + dayofmonthbaseline + b_3 * log(x_1) + b_4 * log(x_2) + b_5 * \hat log(store visits)
where as \hat denotes that the storevisits are predicted values from the equation below:
log(store visits) := dayofweekbaseline + dayofmonthbaseline + b_1 * log(x_1) + b_2 * log(x_2)
Now note that the baseline of store visits and store sales stems from the same underlying demand which is unobserved although we are trying to model it with e.g an UCM approach or just basic regression with dummies.
In the equation for store visits we aim to capture this demand by the baselines but notice now when we plug into these predicted values into the equation for store sales the estimate for b_5 will be severely biased since store visits now captures this underlying demand(which should have been attributed to the baselines in the store sales equation).
I been thinking about using copulas as latent instrumental variables to attack this issue, i wonder however if the normality tests for my predicted store visit values would be sufficient, it bothers me that we assume a standard error for the store visits equation but that does not dictate that the actual predicted values would be normally distributed or am i missing out on something..
Any tips on how to go on with this?
However, this method also has some limitations and challenges. One of them is the choice of the copula function, which can affect the estimation results and the inference. There are different types of copula functions, such as Gaussian, Student’s t, Clayton, Gumbel, etc., that can capture different degrees and patterns of dependence. Choosing an appropriate copula function requires some prior knowledge of the data and the problem, as well as some model selection criteria, such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC).
Another challenge is the normality test for the predicted values of the endogenous regressor. As you mentioned, the normality test is usually performed on the residuals, not on the predicted values. The residuals are the difference between the observed values and the predicted values, and they measure the error or the deviation of the model from the data. The normality test is based on the assumption that the residuals are normally distributed with a mean of zero and a constant variance. This assumption is important for the validity of the hypothesis tests and the confidence intervals of the model parameters.
The predicted values, on the other hand, are the values that the model expects to see for a given set of explanatory variables. They are not necessarily normally distributed, and they depend on the functional form of the model and the distribution of the explanatory variables. For example, if the model is linear and the explanatory variables are normally distributed, then the predicted values will also be normally distributed. But if the model is nonlinear or the explanatory variables are skewed, then the predicted values may not be normally distributed.
Therefore, performing a normality test on the predicted values may not be sufficient or appropriate for assessing the model fit or the validity of the inference. A better approach would be to perform a normality test on the residuals of the model that predicts the endogenous regressor, and then use the copula method to estimate the structural equation of interest. Alternatively, you can use a graphical tool, such as a normal probability plot or a quantilequantile plot, to visually check the normality of the predicted values or the residuals.