Oops: Bootstrapping for Dummies

Thomas Kley · Post by **Thomas Kley** » Fri May 25, 2007 4:21 pm

Hello, experienced PLS-Users,

unfortunately, this was the first time for me to apply the bootstrapping routine (in Smart PLS) to evaluate the total effects of my model.

Default settings are "Cases=100" und "Sample=200" (?!): Do I have to stick to these defaults, or do I have to adjust, e.g. the number of the "cases" to my data? //

Could anyone be so kind as to provide some information on "Bootstrapping for Dummies"! Thanx in advance,

Tom

Diogenes · Post by **Diogenes** » Wed May 30, 2007 3:43 am

Hi,

Hints:
1) Use "search" function in this forum to learn how to use SmartPLS
2) See
viewtopic.php?t=273&highlight=bootstrapping
viewtopic.php?t=151&highlight=bootstrapping

cases = number of cases (or observations) in your colected sample
sample = number of times that new samples will be re-sampled from the raw data and the parameters of the model will be computed - usually 500 or 1000

Best regards.

Bido

viswadatta · Post by **viswadatta** » Fri Jun 01, 2007 4:45 pm

Let us say that your sample size is 200. In bootstrap you need to select a smaller number (say 50). Now you decide your sample of samples size (say 500).
What you do now is select a sample of 50 from your data and compute the mean for each variable. This is the first sample. You repeat this process 500 times. What you get is mean of means, and as per central limit theorem, the mean data is normally distributed about the mean of means(unlike raw data). Now you can compute t-statistic.

Make sure that your number of cases selected per sample of bootstrapping is less than your total data set.

Diogenes · Post by **Diogenes** » Sat Jun 02, 2007 3:34 am

Hi professor Vivek,

I disagree with you, but I will try to explain why the number of cases should be equal the number in the original data.

First: see some examples:
--------------------------------------------------------------------------------
EFRON, Bradley. The bootstrap and modern statistics. Journal of the American Statistical Association; Dec 2000; v. 95, n. 452; pg. 1293

“Each ‘theta’ was calculated by drawing 20 points at random, with replacement, from the 20 actual data points in the left panel, and then computing the Pearson sample correlation coefficient for this bootstrap data set.”
-----------------------------------------------------------------------

HESTERBERGER, Tim et al. Bootstrap methods and permutation tests – Companion chapter 18 to The Practice of Business Statistics. New York, W. H. Freeman and Company, 2003. Available in http://bcs.whfreeman.com/pbs/cat_160/PBS18.pdf

pg. 18-7 ==> Procedure for bootstrapping
Step 1: Resample. Create hundreds of new samples, called bootstrap samples or resamples, by sampling with replacement from the original random sample. Each sample is the same size as the original random sample.
--------------------------------------------------------------------

ANDREWS; Donald W. K.; BUCHINSKY, Moshe. Three-Step Method for Choosing the Number of Bootstrap Repetitions. Econometrica, Vol. 68, No. 1. (Jan., 2000), pp. 23-51.

pg. 27 ==> “The observed data are a sample of size n ..... Let [...] be a bootstrap sample of size n based on the original sample..”
------------------------------------------------------------------------------

The consequences of smaller number of cases in the bootstrap than in the original data set could be seen in http://www.stata.com/support/faqs/stat/reps.html
However, the standard error estimates are dependent upon the number of observations in each replication (as Samy has remembered us).
----------------------------------------------------------------------

Finally,

viewtopic.php?t=151&highlight=bootstrap
As professor Stefan said:
“I would even argue that small bootstrap sample sizes tend to produce greater variance in the parameter estimates of the individual bootstrap runs and thus "deflate" your t-values.”

And in the next box he continues:
“...ALWAYS using the original N as the bootstrap sample size.”

Best regards

Bido

viswadatta · Post by **viswadatta** » Mon Jun 04, 2007 11:57 am

Thanks for the detaile explanation, but in that case why can the boot strap sample size not be greater than the sample size. Is the sample size not relatively unimportant when compared to the number of resamples?

Is there any specific rule stating that the sample size should be only the data set size?

Diogenes · Post by **Diogenes** » Mon Jun 04, 2007 3:07 pm

Hi,

the justification is that using n = same n of original sample,
the SEboot = s/sqr root n
Standard error from bootstrap = Standard error given by usual formula
(HESTERBERGER, 2003)

The standard error estimates are dependent upon the number of observations in each replication (GOULD; PITBLADO, 2005 - http://www.stata.com/support/faqs/stat/reps.html), They have presented a didactical example.

But I will look for a new didactical example for us.

Best regards.

Bido

viswadatta · Post by **viswadatta** » Wed Jun 06, 2007 12:14 pm

Thanks prof

In any case each bootstrap will give different values, but what about bootstaps with size larger than the sample size?

Will the path significances change with sampled size variations?

Diogenes · Post by **Diogenes** » Fri Jun 08, 2007 7:16 pm

Hi,

To have a example to show, I had made a simulation as follow:
1) To generate five variables (50 cases) with these correlations: r(x,y1)=0,5; r(x,y2)=0,6; r(x,y3)=0,7; r(x,y4)=0,8.
2) The sample correlations were: 0,575; 0,662; 0,748 and 0,831
3) To use bootstrap procedure from SmartPLS, I had modeled X as exogenous variable which was connected with the others four variables at the same time.
4) 24 runs were made with:
• n = 25, 50, 75 and 500 (cases)
• B = 100, 200, 500, 1000, 2000 and 5000 (resamples)

5) Conclusions about MEAN: In all the runs (it does not matter the number of the cases in each resample), when B >= 200 resamples the means were close to the sample means.
For r=0,5, the results were (the others results were the same pattern):
_______n=25 n=50 n=75 n=500_______
B=100 0,5506 0,5725 0,5621 0,5736
B=200 0,5722 0,5795 0,5738 0,5765
B=500 0,5758 0,5737 0,5712 0,5760
B=1000 0,5639 0,5729 0,5715 0,5752
B=2000 0,5701 0,5699 0,5677 0,5738
B=5000 0,5669 0,5708 0,5734 0,5738
Original 0,5745 0,5745 0,5745 0,5745

6) Conclusions about STD ERROR (and t-values): In all the runs:
• The value of the standard error almost didn’t change with B;
• The value of the standard error depends of n (Like was showed by GOULD; PITBLADO, 2005 - http://www.stata.com/support/faqs/stat/reps.html)
• Smaller n had given bigger std error, then smaller t-values
• Bigger n had given smaller std error, then bigger t-values

For r=0,5, the results were:
________n=25 n=50 n=75 n=500_______
B=100 0,1397 0,0897 0,0748 0,0282
B=200 0,1402 0,1002 0,0742 0,0289
B=500 0,1296 0,0908 0,0739 0,0281
B=1000 0,1374 0,0860 0,0779 0,0281
B=2000 0,1333 0,0935 0,0718 0,0287
B=5000 0,1352 0,0906 0,0744 0,0284

Finnaly:
For n= 50 and B=5000 (same pattern for others combinations); we could see that bigger r, has smaller std error:
r=0,5 (0,0906); r=0,6 (0,0743); r=0,7 (0,0569) and r=0,8 (0,0385)

7) Two curious notes:
7.1) Consistent with the recommendation that n = n in the original sample
The Resampling software haven’t a option to define the n value (assumption that n of the resample = n of the original sample)
HOWELL, D. C. Software Resampling Procedures. Version 1.3. Vermont University. Departament of Psychology, 2001. Available in: http://www.uvm.edu/~dhowell/StatPages/R ... pling.html

7.2) A “strong reference” but with different recommendation !! ??
In PRELIS guide we have in the page 184 “The original data consists of N cases and we want to draw K samples of size n. The drawing is done with replacement. The number n may be smaller than, equal to, or larger than N”
JÖRESKOG, K. G.; SÖRBOM, D. PRELIS 2: user’s reference guide. A program for multivariate data screening and data summarization; a preprocessor for LISREL. SSI – Scientific Software International, 2002.

This flexility is a good function in the software, but we have to remember that n=n (or n_boot = 100% n_original) is the right recommendation to have t-values, significances or I.C. computed by bootstrap procedure.

Best regards

Bido

viswadatta · Post by **viswadatta** » Sat Jun 09, 2007 10:07 am

Thanks for the detailed explanation prof, its a great work. I suppose it is a grey area in research, but convention demands that the sample size is the same as the original sample size.
Thanks for all the references by the way.

Diogenes · Post by **Diogenes** » Wed Apr 09, 2008 2:49 pm

Hi,
you should use the number of cases (391) in the power analysis.
the samples are the number of repetitions (ideally infinite).
Best regards.
Bido

Diogenes · Post by **Diogenes** » Wed Apr 09, 2008 9:17 pm

Hi,

Just one model) H0: path = 0 --> "mean difference from a constant".

Comparing two models) H0: path1 = path2 --> "two independent samples"

Best regards.
Bido