Multi-group comparisons with PLS

stefanbehrens · Post by **stefanbehrens** » Mon Jan 30, 2006 9:21 pm

Dear PLS-Community,

I would like to compare two subsamples of my data for differences in the structural path estimates. In particular, I would like to test, if the observed differences in the estimates are actually significant.

Dibbern/ Chin (2005) describe a procedure using random permutations. However, their description of the algorithm is rather sketchy with regard to how the data are actually permuted. In particular, I would like to understand how the individual random permutations are created if the size of the two subsamples is different.

Could anyone explain how this is done?

Also, I'm wondering if anybody has already done this using SmartPLS. Creating the 100+ permutations manually and then estimating 200+ models sounds tedious. Is there an easier way?

Thanks in advance.
Cheers,
Stefan

stefanbehrens · Post by **stefanbehrens** » Fri Mar 03, 2006 12:24 pm

All,

since nobody has answered my post, I had to find a way to solve this myself ;-). I'm posting the procedure here for anyone who is interested (feedback highly welcome):

A) run the model with the full data set in PLS and save the LV-scores
B) split the file with the LV-scores into two subgroups you would like to compare
C) run a PLS model for each subgroup using the LV-scores as single indicators for each LV and note down the path coefficients
D) import the full sample of LV-scores (from step A) into SPSS (or any other statistical software)
E) run a macro that
- randomly filters a number of cases from your sample without replacement (same number as there are cases in your first subgroup)
- run a multiple linear regression analysis and store the regression coefficients in a file
- invert the filter (thus selecting the remaining cases from your sample)
- run a multiple linear regression analysis and store the regression coefficients in a file (preferably a different file)
- repeat for as many permutations as you want (e.g. 1000)
F) Load both files with the regression coefficients into Excel (2 sheets)
G) Create a third sheet that contains the differences of each coefficient from corresponding lines in the sheets with the regression coefficients (this is now essentially a bootstrap distribution of your difference statistic)
H) Calculate the coefficient differences for your two subsamples (step C)
I) Check how many permutations have resulted in a larger/ smaller difference than those calculated in step H
J) Based on the total number of permutations (e.g. 1000) calculate the percentage of those permutations larger/ smaller than the observed differences in your subsamples. Percentages below 5% and above 95% indicate differences that are significant at alpha=0,05.

Comments on this procedure are highly welcome.

Cheers,
Stefan

PS: For those who are interested, I'm posting the SPSS Syntax for the permutation procedure below...

Code: Select all


* ============================================================
* = PERMUTATION TESTING OF REGRESSION COEFFICIENT DIFFERENCES
* = SPSS 13.0 SYNTAX 
* =.



* ====  Change Program Settings  ====.

SET MITERATE=10000.
PRESERVE.
SET TVARS NAMES.
OMS /DESTINATION VIEWER=NO /TAG='suppressall'.

* Select regression coefficients tables and write to data file.
OMS /SELECT TABLES
    /IF COMMANDS=['Regression'] SUBTYPES=['Coefficients']
    /DESTINATION FORMAT=SAV OUTFILE='--NAME AND PATH OF OUTPUT FILE--.sav'
    /COLUMNS DIMNAMES=[ 'Variables'  'Statistics']
    /TAG='reg_coeff'.


* ====  Define Macro for Permutations  ====.

DEFINE regression_permutations (permutations=!TOKENS(1)
	/subsample=!TOKENS(1)
	/depvar=!TOKENS(1)
	/indvars=!CMDEND)

COMPUTE dummyvar=1.
AGGREGATE
  /OUTFILE = * MODE = ADDVARIABLES
  /BREAK=dummyvar
  /filesize=N.
!DO !other=1 !TO !permutations
SET SEED RANDOM.
WEIGHT OFF.
FILTER OFF.
DO IF $casenum=1.
- COMPUTE #subsamplesize=!subsample.
- COMPUTE #filesize=filesize.
END IF.
DO IF (#subsamplesize>0 and #filesize>0).
- COMPUTE filter_$ = uniform(1)* #filesize < #subsamplesize.
- COMPUTE #subsamplesize = #subsamplesize - filter_$.
- COMPUTE #filesize = #filesize - 1.
ELSE.
- COMPUTE filter_$ = 0.
END IF.
EXECUTE.
FILTER BY filter_$.
REGRESSION
  /STATISTICS COEFF
  /DEPENDENT !depvar
  /METHOD=ENTER !indvars.
RECODE filter_$  (1=0)  (0=1).
EXECUTE .
REGRESSION
  /STATISTICS COEFF
  /DEPENDENT !depvar
  /METHOD=ENTER !indvars.
!DOEND

!ENDDEFINE.


* ====  Get Input File for Regression  ====.

GET FILE='--NAME AND PATH OF INPUT FILE--.sav'.


* ====  Call Permuation Macro  ====.

regression_permutations
   permutations=1000
   subsample=50
   depvar= dependentvariable
   indvars= independentvariable1 independentvariable2 independentvariable3 .


* ====  Normalize Settings and Open Results  ====.

OMSEND.
RESTORE.
GET FILE '--NAME AND PATH OF OUTPUT FILE--.sav'.

joerghenseler · Post by **joerghenseler** » Mon Mar 06, 2006 2:12 pm

Hi Stefan,

In principle, your work around is okay.
However, I would suggest you another approach:
I recommend you to have a look at the literature on modeling interaction effects in linear regressions, mainly the small green booklet by Jaccard and Turrisi.
They say basically that you can treatt the grouping variable (must be in dummy form) like a moderating variable.

The bootstrap results of the interaction term gives you an answer on the question whether there are significant group differences.

Best regards,

Jörg

cringle · Post by **cringle** » Tue Mar 07, 2006 4:44 am

Great effort, Stefan, this looks good!

Well, you are right about the paper you mentioned as well… However, Chin suggests for pragmatic reasons to run bootstrap re-sampling for the various groups and treat the standard error estimates from each resampling in a parametric sense via t-tests – see http://disc-nt.cba.uh.edu/chin/plsfaq.htm. The weaknesses of such an approach are also strived.

Stefan, would it be useful to check, if the two groups are different compared to the estimates for the overall sample (that’s at least something I am always looking at when analyzing certain groups). In this case, you could just run several hundred bootstrap (with n = number of cases in the group of interest) and then analyze if the group specific estimates are significantly different compared to the randomly created groups from the overall sample (path estimates for each bootstrap can be easily extracted from the SmartPLS report and then be checked as you stated before).

Best
Christian

stefanbehrens · Post by **stefanbehrens** » Tue Mar 07, 2006 8:58 am

Jörg/ Christian,

thank you both for your helpful comments. Chin's suggested approach for using bootstrap parameters in a parametric sense is actually what I tried first. However, his formula gave me very high t-values even for the smallest coefficient differences (e.g., t=18.3 for a path difference of 0.04 which simply can't be right). Hence, I'm wondering if I used the formula correctly. In particular, I'm unsure about the calculation of the standard errors. Since SmartPLS does not provide this statistic in the bootstrap report, I calculated SE for a bootstrap with N=1000 runs as follows: SE= StDev/ SQRT(N). Is this correct?

Since all the other parts of his formula are very straightforward, I don't see where else I could possible have gone wrong. Playing around with the N in the above calculation I found that if I used the subgroup size (42) instead of the number of bootstrap runs (1000), the resulting t-values were much more realistic. However, I can't see the theoretical rationale for using the 42 instead of the 1000. Could you shed some light on this?

Many thanks in advance!

All the best,
Stefan

cringle · Post by **cringle** » Tue Mar 07, 2006 9:24 am

Hi Stefan,

the bootstrapping se is equal to the std.dev.

See also Chin's note:
"The formula may differ from standard texts contrasting regression coefficients. The reason for the difference is the use of the bootstrapped standard error. This standard error is already mean adjusted reflecting the standard deviation of the sampling distribution as opposed to the sample or population standard deviation. The formula in many books assume the latter and goes on to adjust by the sample size. So, we need to correct for it by multiplying the SE from the bootstrap by the square root of the sample as well. In other words, (n-1)*square (SE from bootstrap) represents the variance of the sample."

Best
Christian

stefanbehrens · Post by **stefanbehrens** » Wed Mar 08, 2006 12:13 am

Hi Christian,

thank you for pointing this out (SE=Bootstrap Stdev). This certainly explains why the formula didn't work earlier. However, I'm still not sure how to properly apply the formula.

In particular, I have found two different versions of the formula:

Version 1 (from Chin's homepage):
S_Pooled = SQRT{[(n1-1)^2/(n1+n2-2)]*SE1^2 + [(n2-1)^2/(n1+n2-2)]*SE2^2}
t = (b1-b2)/[S_Pooled*SQRT(1/n1+1/n2)]

Version 2 (in Keil et al. 2000, MISQ 24(2), p.315):
S_Pooled = SQRT{[(n1-1)/(n1+n2-2)]*SE1^2 + [(n2-1)/(n1+n2-2)]*SE2^2}
t = (b1-b2)/[S_Pooled*SQRT(1/n1+1/n2)]

where:
- n1, n2 = number of cases in subgroup 1 and 2
- b1, b2 = path coefficients in model 1 and 2
- SE1, SE2 = standard deviation of b1 and b2 (using bootstrap)

Notice the missing "squaring" of the (n1-1) and (n2-1) terms in the calculation of S_Pooled for version 2.

Obviously, the resulting t-values are vastly different from version 1 to version 2 (by a factor of ~6 in my case).

I'm tending to accept Version 1 as correct, would you agree? If yes, do you have a reference for the correct formula besides Chin's website?

Many thanks in advance,
Stefan

PS: I have compared the results of Version 1 with the results of my permutation test procedure to cross-validate. In fact, the t-values calculated with Version 1 of the formula are very close to the t-values calculated using the permutation data:

t_perm = (b1_obs-b2_obs)/stdev(b1_perm-b2_perm)

where:
- b1_obs, b2_obs are the observed coefficients for subgroups 1 and 2
- b1_perm, b2_perm are the calculated coefficients 1000 permutation runs

joerghenseler · Post by **joerghenseler** » Wed Mar 08, 2006 2:45 pm

Hi Stefan,

I think what you are looking for is the so-called "Chow test".
You can find information on that e.g. in Gujarati (2003: Basic Econometrics).
I think there was also a paper by Bass and Wittink (JMR, 1975) on the pooling issue.

Regards,

Jörg

bastid · Post by **bastid** » Mon Jul 09, 2007 12:09 pm

Hi everybody,

sorry for bringing this thread up, but I have the same problem as mentioned above, but no answer: Chin's formula for Sp is different from the one used by Keil et al.

I just found an article on the web (Sanchez-Franco, M. J., Exploring the influence of gender on the web usage via partial least squares, Behaviour & Information Technology, Vol. 25, No. 1, January-February 2006, 19 – 36), where the formula from Keil et al. is used.

Does anyone know exactly, which one is correct? As written above, t-values are approx. 6 times higher with Keil's formula.

Thank you in advance

Sebastian Dettmers

stefanbehrens · Post by **stefanbehrens** » Mon Jul 09, 2007 2:27 pm

The version from Chin's website (no. 1 in my earlier post) is definitely the correct one.

Keil et al's formula must be a mistake. (By the way, their own calculated values in the paper cannot be reproduced if you fill in the numbers in their formula... so much for A-rated journals ;)

Cheers,
Stefan

Diogenes · Post by **Diogenes** » Tue Jul 10, 2007 11:29 pm

Hi all,

To compare two groups we could include the group variable as a dummy in the model, but as Maruyama (1998, p.258) sad : “Capturing means differences, however, is not the same as asking about similarity processes.”

In LISREL we could restrict the paths or loadings for two groups be the same, but in PLS.

Then I had implemented the suggestion in
viewtopic.php?t=380

I have used this procedure to compare two groups and it looks me Ok (Am I being naive ?)

1) Run the model with all data and remove the indicators with low convergent validity
2) Split the data in two (one for each group)
3) Run bootstrap for each group and save the MEANS and STD_ERROR (the Standard deviation from bootstrap already is the STD-ERROR of means – then we haven´t to divide by n or n-1) of the loadings and paths
4) Assess the invariance of the measurement model:
For all loadings in the model, compute t-value for diferences between the loadings -->
--> t = (load1 – load2)/((SE1 ^2 + SE2 ^2)^0,5)
Compute Significances (p-value) for each t-value.
(Idea adapted from MARUYAMA, 1998, p. 259)

If all diferences are not significant it´s perfect, but if some of them are significant?

I have adapted another idea from SHIPLEY (2000, p. 74)
Compute the Composite probability for the measurement model using Fisher’s test
C = - 2 SUM ln(p)
“If all k independence relationship are true [differences = zero], then this statistics will follow a chi-squared distribuition with 2k degrees of freedom.”

t --> p --> ln p --> SUM lnp --> C = -2 SUM lnp --> Chi squared table (df = 2k)

5) Being Ok we could to follow to the comparation of the paths
Same procedure: Compute t-values for each difference between the paths (We could finish here because some paths indeed are different) or

If a few paths have little significant differences we could compute the Composite probability to see if for the model as a whole these differences are significant.

MARUYAMA, Geoffrey M. Basics of Structural Equation Modeling. USA: Sage Publications, Inc., 1998.
SHIPLEY, Bill. Cause and Correlation in Biology: a user’s guide to path analysis, structural equations and causal inference. United Kingdom: Cambridge University Press: 2000.

Please, if someone find a mistake, I will be happy of knowing.

Bido

bastid · Post by **bastid** » Wed Jul 11, 2007 8:31 am

stefanbehrens wrote:Keil et al's formula must be a mistake. (By the way, their own calculated values in the paper cannot be reproduced if you fill in the numbers in their formula... so much for A-rated journals ;)

Cheers,
Stefan

Hi Stephan,

thank you for your quick answer! I though it was because I have an AMD processor ;)... I was checking my formulas in excel over and over again...

@ Prof. Diogenes

I just tested your suggestion, but my t-values absolutely differ from the ones I have calculated with Chin's formula. In fact, t-values are less sensitive to changes in loadings or SEs with t = (load1 – load2)/((SE1 ^2 + SE2 ^2)^0,5).

Regards

Sebastian

schroer · Post by **schroer** » Mon Sep 10, 2007 1:23 pm

Hello,

I was wondering if anyone used the formula 2 suggested by Chin -- or rather, the suggested adjustment of the degrees of freedom in formula 3. Unfortunately, formula 3 on the web site [1] does not seem to work for me, because the adjusted df values seem to be much too low.

I was under the impression that the formula for the Welch-Satterthwaite adjustment was:

df_corr = (w1 + w2)^2 / [(w1^2/n1-1) + (w2^2/n2-1)]

where

w1 = s1^2/n1 and w2 = s2^2/n2 .

In fact, that formula works reasonably well. However, I was wondering if that usage is correct given Chin's comment (which Christian quoted above) that a correction by sample size is wrong in this case...

Thanks and best wishes,

Joachim

[1] http://disc-nt.cba.uh.edu/chin/plsfaq/multigroup.htm

sergeja · Post by **sergeja** » Wed Oct 31, 2007 5:00 pm

Dear all,

I used Chin's formula to compare path coefficients of two groups. My puzzle is how to interpret the results: I obtain some significant paths in one group, but not in the other, yet the t-test of the differences between the parallel paths are in many instances not significant. How can I explain the results of each submodel than?

Your answer will be highly appreciated!

Best wishes. Sergeja

Khim Kelly · Post by **Khim Kelly** » Sat Nov 10, 2007 2:02 pm

sergeja wrote:Dear all,

I used Chin's formula to compare path coefficients of two groups. My puzzle is how to interpret the results: I obtain some significant paths in one group, but not in the other, yet the t-test of the differences between the parallel paths are in many instances not significant. How can I explain the results of each submodel than?

Your answer will be highly appreciated!

Best wishes. Sergeja

I am encountering the same problem as Sergeja. My path coefficient is significant for one group but insignificant for the other group. But Chin's t-test indicates that the difference between the two path coefficients is insignificant. My hypothesis is that the path coefficients are different. So, I am not sure what to conclude. Any help on this would be appreciated.

forum.smartpls.com

Multi-group comparisons with PLS

Multi-group comparisons with PLS

Group comparisons

Chow

Interpretation of t-test of path differences

Re: Interpretation of t-test of path differences