Multi-group comparisons with PLS

Questions about the implementation and application of the PLS-SEM method, that are not related to the usage of the SmartPLS software.
stefanbehrens
PLS Expert User
Posts: 54
Joined: Wed Oct 19, 2005 5:53 pm
Real name and title:

Multi-group comparisons with PLS

Post by stefanbehrens »

Dear PLS-Community,

I would like to compare two subsamples of my data for differences in the structural path estimates. In particular, I would like to test, if the observed differences in the estimates are actually significant.

Dibbern/ Chin (2005) describe a procedure using random permutations. However, their description of the algorithm is rather sketchy with regard to how the data are actually permuted. In particular, I would like to understand how the individual random permutations are created if the size of the two subsamples is different.

Could anyone explain how this is done?

Also, I'm wondering if anybody has already done this using SmartPLS. Creating the 100+ permutations manually and then estimating 200+ models sounds tedious. Is there an easier way?

Thanks in advance.
Cheers,
Stefan
stefanbehrens
PLS Expert User
Posts: 54
Joined: Wed Oct 19, 2005 5:53 pm
Real name and title:

Post by stefanbehrens »

All,

since nobody has answered my post, I had to find a way to solve this myself ;-). I'm posting the procedure here for anyone who is interested (feedback highly welcome):

A) run the model with the full data set in PLS and save the LV-scores
B) split the file with the LV-scores into two subgroups you would like to compare
C) run a PLS model for each subgroup using the LV-scores as single indicators for each LV and note down the path coefficients
D) import the full sample of LV-scores (from step A) into SPSS (or any other statistical software)
E) run a macro that
- randomly filters a number of cases from your sample without replacement (same number as there are cases in your first subgroup)
- run a multiple linear regression analysis and store the regression coefficients in a file
- invert the filter (thus selecting the remaining cases from your sample)
- run a multiple linear regression analysis and store the regression coefficients in a file (preferably a different file)
- repeat for as many permutations as you want (e.g. 1000)
F) Load both files with the regression coefficients into Excel (2 sheets)
G) Create a third sheet that contains the differences of each coefficient from corresponding lines in the sheets with the regression coefficients (this is now essentially a bootstrap distribution of your difference statistic)
H) Calculate the coefficient differences for your two subsamples (step C)
I) Check how many permutations have resulted in a larger/ smaller difference than those calculated in step H
J) Based on the total number of permutations (e.g. 1000) calculate the percentage of those permutations larger/ smaller than the observed differences in your subsamples. Percentages below 5% and above 95% indicate differences that are significant at alpha=0,05.

Comments on this procedure are highly welcome.

Cheers,
Stefan

PS: For those who are interested, I'm posting the SPSS Syntax for the permutation procedure below...

Code: Select all


* ============================================================
* = PERMUTATION TESTING OF REGRESSION COEFFICIENT DIFFERENCES
* = SPSS 13.0 SYNTAX 
* =.



* ====  Change Program Settings  ====.

SET MITERATE=10000.
PRESERVE.
SET TVARS NAMES.
OMS /DESTINATION VIEWER=NO /TAG='suppressall'.

* Select regression coefficients tables and write to data file.
OMS /SELECT TABLES
    /IF COMMANDS=['Regression'] SUBTYPES=['Coefficients']
    /DESTINATION FORMAT=SAV OUTFILE='--NAME AND PATH OF OUTPUT FILE--.sav'
    /COLUMNS DIMNAMES=[ 'Variables'  'Statistics']
    /TAG='reg_coeff'.


* ====  Define Macro for Permutations  ====.

DEFINE regression_permutations (permutations=!TOKENS(1)
	/subsample=!TOKENS(1)
	/depvar=!TOKENS(1)
	/indvars=!CMDEND)

COMPUTE dummyvar=1.
AGGREGATE
  /OUTFILE = * MODE = ADDVARIABLES
  /BREAK=dummyvar
  /filesize=N.
!DO !other=1 !TO !permutations
SET SEED RANDOM.
WEIGHT OFF.
FILTER OFF.
DO IF $casenum=1.
- COMPUTE #subsamplesize=!subsample.
- COMPUTE #filesize=filesize.
END IF.
DO IF (#subsamplesize>0 and #filesize>0).
- COMPUTE filter_$ = uniform(1)* #filesize < #subsamplesize.
- COMPUTE #subsamplesize = #subsamplesize - filter_$.
- COMPUTE #filesize = #filesize - 1.
ELSE.
- COMPUTE filter_$ = 0.
END IF.
EXECUTE.
FILTER BY filter_$.
REGRESSION
  /STATISTICS COEFF
  /DEPENDENT !depvar
  /METHOD=ENTER !indvars.
RECODE filter_$  (1=0)  (0=1).
EXECUTE .
REGRESSION
  /STATISTICS COEFF
  /DEPENDENT !depvar
  /METHOD=ENTER !indvars.
!DOEND

!ENDDEFINE.


* ====  Get Input File for Regression  ====.

GET FILE='--NAME AND PATH OF INPUT FILE--.sav'.


* ====  Call Permuation Macro  ====.

regression_permutations
   permutations=1000
   subsample=50
   depvar= dependentvariable
   indvars= independentvariable1 independentvariable2 independentvariable3 .


* ====  Normalize Settings and Open Results  ====.

OMSEND.
RESTORE.
GET FILE '--NAME AND PATH OF OUTPUT FILE--.sav'.

Last edited by stefanbehrens on Tue Mar 07, 2006 9:10 am, edited 1 time in total.
User avatar
joerghenseler
PLS Expert User
Posts: 39
Joined: Fri Oct 14, 2005 9:59 am
Real name and title:

Group comparisons

Post by joerghenseler »

Hi Stefan,

In principle, your work around is okay.
However, I would suggest you another approach:
I recommend you to have a look at the literature on modeling interaction effects in linear regressions, mainly the small green booklet by Jaccard and Turrisi.
They say basically that you can treatt the grouping variable (must be in dummy form) like a moderating variable.

The bootstrap results of the interaction term gives you an answer on the question whether there are significant group differences.

Best regards,

Jörg
User avatar
cringle
SmartPLS Developer
Posts: 818
Joined: Tue Sep 20, 2005 9:13 am
Real name and title: Prof. Dr. Christian M. Ringle
Location: Hamburg (Germany)
Contact:

Post by cringle »

Great effort, Stefan, this looks good!

Well, you are right about the paper you mentioned as well… However, Chin suggests for pragmatic reasons to run bootstrap re-sampling for the various groups and treat the standard error estimates from each resampling in a parametric sense via t-tests – see http://disc-nt.cba.uh.edu/chin/plsfaq.htm. The weaknesses of such an approach are also strived.

Stefan, would it be useful to check, if the two groups are different compared to the estimates for the overall sample (that’s at least something I am always looking at when analyzing certain groups). In this case, you could just run several hundred bootstrap (with n = number of cases in the group of interest) and then analyze if the group specific estimates are significantly different compared to the randomly created groups from the overall sample (path estimates for each bootstrap can be easily extracted from the SmartPLS report and then be checked as you stated before).

Best
Christian
Last edited by cringle on Tue Mar 07, 2006 9:22 am, edited 1 time in total.
stefanbehrens
PLS Expert User
Posts: 54
Joined: Wed Oct 19, 2005 5:53 pm
Real name and title:

Post by stefanbehrens »

Jörg/ Christian,

thank you both for your helpful comments. Chin's suggested approach for using bootstrap parameters in a parametric sense is actually what I tried first. However, his formula gave me very high t-values even for the smallest coefficient differences (e.g., t=18.3 for a path difference of 0.04 which simply can't be right). Hence, I'm wondering if I used the formula correctly. In particular, I'm unsure about the calculation of the standard errors. Since SmartPLS does not provide this statistic in the bootstrap report, I calculated SE for a bootstrap with N=1000 runs as follows: SE= StDev/ SQRT(N). Is this correct?

Since all the other parts of his formula are very straightforward, I don't see where else I could possible have gone wrong. Playing around with the N in the above calculation I found that if I used the subgroup size (42) instead of the number of bootstrap runs (1000), the resulting t-values were much more realistic. However, I can't see the theoretical rationale for using the 42 instead of the 1000. Could you shed some light on this?

Many thanks in advance!

All the best,
Stefan
User avatar
cringle
SmartPLS Developer
Posts: 818
Joined: Tue Sep 20, 2005 9:13 am
Real name and title: Prof. Dr. Christian M. Ringle
Location: Hamburg (Germany)
Contact:

Post by cringle »

Hi Stefan,

the bootstrapping se is equal to the std.dev.

See also Chin's note:
"The formula may differ from standard texts contrasting regression coefficients. The reason for the difference is the use of the bootstrapped standard error. This standard error is already mean adjusted reflecting the standard deviation of the sampling distribution as opposed to the sample or population standard deviation. The formula in many books assume the latter and goes on to adjust by the sample size. So, we need to correct for it by multiplying the SE from the bootstrap by the square root of the sample as well. In other words, (n-1)*square (SE from bootstrap) represents the variance of the sample."

Best
Christian
stefanbehrens
PLS Expert User
Posts: 54
Joined: Wed Oct 19, 2005 5:53 pm
Real name and title:

Post by stefanbehrens »

Hi Christian,

thank you for pointing this out (SE=Bootstrap Stdev). This certainly explains why the formula didn't work earlier. However, I'm still not sure how to properly apply the formula.

In particular, I have found two different versions of the formula:

Version 1 (from Chin's homepage):
S_Pooled = SQRT{[(n1-1)^2/(n1+n2-2)]*SE1^2 + [(n2-1)^2/(n1+n2-2)]*SE2^2}
t = (b1-b2)/[S_Pooled*SQRT(1/n1+1/n2)]

Version 2 (in Keil et al. 2000, MISQ 24(2), p.315):
S_Pooled = SQRT{[(n1-1)/(n1+n2-2)]*SE1^2 + [(n2-1)/(n1+n2-2)]*SE2^2}
t = (b1-b2)/[S_Pooled*SQRT(1/n1+1/n2)]

where:
- n1, n2 = number of cases in subgroup 1 and 2
- b1, b2 = path coefficients in model 1 and 2
- SE1, SE2 = standard deviation of b1 and b2 (using bootstrap)

Notice the missing "squaring" of the (n1-1) and (n2-1) terms in the calculation of S_Pooled for version 2.

Obviously, the resulting t-values are vastly different from version 1 to version 2 (by a factor of ~6 in my case).

I'm tending to accept Version 1 as correct, would you agree? If yes, do you have a reference for the correct formula besides Chin's website?

Many thanks in advance,
Stefan

PS: I have compared the results of Version 1 with the results of my permutation test procedure to cross-validate. In fact, the t-values calculated with Version 1 of the formula are very close to the t-values calculated using the permutation data:

t_perm = (b1_obs-b2_obs)/stdev(b1_perm-b2_perm)

where:
- b1_obs, b2_obs are the observed coefficients for subgroups 1 and 2
- b1_perm, b2_perm are the calculated coefficients 1000 permutation runs
User avatar
joerghenseler
PLS Expert User
Posts: 39
Joined: Fri Oct 14, 2005 9:59 am
Real name and title:

Chow

Post by joerghenseler »

Hi Stefan,

I think what you are looking for is the so-called "Chow test".
You can find information on that e.g. in Gujarati (2003: Basic Econometrics).
I think there was also a paper by Bass and Wittink (JMR, 1975) on the pooling issue.

Regards,

Jörg
bastid
PLS Junior User
Posts: 2
Joined: Mon Aug 07, 2006 4:24 pm
Real name and title:

Post by bastid »

Hi everybody,

sorry for bringing this thread up, but I have the same problem as mentioned above, but no answer: Chin's formula for Sp is different from the one used by Keil et al.

I just found an article on the web (Sanchez-Franco, M. J., Exploring the influence of gender on the web usage via partial least squares, Behaviour & Information Technology, Vol. 25, No. 1, January-February 2006, 19 – 36), where the formula from Keil et al. is used.

Does anyone know exactly, which one is correct? As written above, t-values are approx. 6 times higher with Keil's formula.

Thank you in advance

Sebastian Dettmers
stefanbehrens
PLS Expert User
Posts: 54
Joined: Wed Oct 19, 2005 5:53 pm
Real name and title:

Post by stefanbehrens »

The version from Chin's website (no. 1 in my earlier post) is definitely the correct one.

Keil et al's formula must be a mistake. (By the way, their own calculated values in the paper cannot be reproduced if you fill in the numbers in their formula... so much for A-rated journals ;)

Cheers,
Stefan
User avatar
Diogenes
PLS Super-Expert
Posts: 899
Joined: Sat Oct 15, 2005 5:13 pm
Real name and title:
Location: São Paulo - BRAZIL
Contact:

Post by Diogenes »

Hi all,

To compare two groups we could include the group variable as a dummy in the model, but as Maruyama (1998, p.258) sad : “Capturing means differences, however, is not the same as asking about similarity processes.”

In LISREL we could restrict the paths or loadings for two groups be the same, but in PLS.

Then I had implemented the suggestion in
viewtopic.php?t=380

I have used this procedure to compare two groups and it looks me Ok (Am I being naive ?)

1) Run the model with all data and remove the indicators with low convergent validity
2) Split the data in two (one for each group)
3) Run bootstrap for each group and save the MEANS and STD_ERROR (the Standard deviation from bootstrap already is the STD-ERROR of means – then we haven´t to divide by n or n-1) of the loadings and paths
4) Assess the invariance of the measurement model:
For all loadings in the model, compute t-value for diferences between the loadings -->
--> t = (load1 – load2)/((SE1 ^2 + SE2 ^2)^0,5)
Compute Significances (p-value) for each t-value.
(Idea adapted from MARUYAMA, 1998, p. 259)

If all diferences are not significant it´s perfect, but if some of them are significant?

I have adapted another idea from SHIPLEY (2000, p. 74)
Compute the Composite probability for the measurement model using Fisher’s test
C = - 2 SUM ln(p)
If all k independence relationship are true [differences = zero], then this statistics will follow a chi-squared distribuition with 2k degrees of freedom.”

t --> p --> ln p --> SUM lnp --> C = -2 SUM lnp --> Chi squared table (df = 2k)


5) Being Ok we could to follow to the comparation of the paths
Same procedure: Compute t-values for each difference between the paths (We could finish here because some paths indeed are different) or

If a few paths have little significant differences we could compute the Composite probability to see if for the model as a whole these differences are significant.


MARUYAMA, Geoffrey M. Basics of Structural Equation Modeling. USA: Sage Publications, Inc., 1998.
SHIPLEY, Bill. Cause and Correlation in Biology: a user’s guide to path analysis, structural equations and causal inference. United Kingdom: Cambridge University Press: 2000.


Please, if someone find a mistake, I will be happy of knowing.

Bido
bastid
PLS Junior User
Posts: 2
Joined: Mon Aug 07, 2006 4:24 pm
Real name and title:

Post by bastid »

stefanbehrens wrote:Keil et al's formula must be a mistake. (By the way, their own calculated values in the paper cannot be reproduced if you fill in the numbers in their formula... so much for A-rated journals ;)

Cheers,
Stefan
Hi Stephan,

thank you for your quick answer! I though it was because I have an AMD processor ;)... I was checking my formulas in excel over and over again...

@ Prof. Diogenes

I just tested your suggestion, but my t-values absolutely differ from the ones I have calculated with Chin's formula. In fact, t-values are less sensitive to changes in loadings or SEs with t = (load1 – load2)/((SE1 ^2 + SE2 ^2)^0,5).

Regards

Sebastian
schroer
PLS Senior User
Posts: 24
Joined: Fri Nov 04, 2005 5:54 pm
Real name and title: Dr. Joachim Schroer
Contact:

Post by schroer »

Hello,

I was wondering if anyone used the formula 2 suggested by Chin -- or rather, the suggested adjustment of the degrees of freedom in formula 3. Unfortunately, formula 3 on the web site [1] does not seem to work for me, because the adjusted df values seem to be much too low.

I was under the impression that the formula for the Welch-Satterthwaite adjustment was:

df_corr = (w1 + w2)^2 / [(w1^2/n1-1) + (w2^2/n2-1)]

where

w1 = s1^2/n1 and w2 = s2^2/n2 .


In fact, that formula works reasonably well. However, I was wondering if that usage is correct given Chin's comment (which Christian quoted above) that a correction by sample size is wrong in this case...

Thanks and best wishes,

Joachim


[1] http://disc-nt.cba.uh.edu/chin/plsfaq/multigroup.htm
Dr. Joachim Schroer

PRIOTAS GmbH
Hohenzollernring 72
50672 Köln

http://www.priotas.de/
Feedback to progress
sergeja
PLS Junior User
Posts: 4
Joined: Tue Mar 07, 2006 3:01 pm
Real name and title:

Interpretation of t-test of path differences

Post by sergeja »

Dear all,

I used Chin's formula to compare path coefficients of two groups. My puzzle is how to interpret the results: I obtain some significant paths in one group, but not in the other, yet the t-test of the differences between the parallel paths are in many instances not significant. How can I explain the results of each submodel than?

Your answer will be highly appreciated!

Best wishes. Sergeja
Khim Kelly

Re: Interpretation of t-test of path differences

Post by Khim Kelly »

sergeja wrote:Dear all,

I used Chin's formula to compare path coefficients of two groups. My puzzle is how to interpret the results: I obtain some significant paths in one group, but not in the other, yet the t-test of the differences between the parallel paths are in many instances not significant. How can I explain the results of each submodel than?

Your answer will be highly appreciated!

Best wishes. Sergeja
I am encountering the same problem as Sergeja. My path coefficient is significant for one group but insignificant for the other group. But Chin's t-test indicates that the difference between the two path coefficients is insignificant. My hypothesis is that the path coefficients are different. So, I am not sure what to conclude. Any help on this would be appreciated.
Post Reply