Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I am using the following code

proc surveyselect data = tmp method = urs sampsize = 500 seed = 100 out = out_tmp; run;

However when I look at the logs I am getting 491 records. My tmp dataset has 30,000 records. Need help to understand why the 9 records are getting dropped. I played around with changing the seed value and I am getting around 470 to 495 records per random seed but never get an absolute 500. Referred to the documentation and URS option means "unrestricted random sampling, which is selection with equal probability and with replacement". Probability being equal has no impact however, replacement terminology , I understand as, a record could be present more than once, which is what I am aiming for.

What I do not understand is why does the drawn sample stops are at number less than the 500 i have specified?

Thanks for the help.

question from:https://stackoverflow.com/questions/65892152/proc-surveryselect-sample-defined-verus-sample-received

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
760 views
Welcome To Ask or Share your Answers For Others

1 Answer

The issue is you're failing to quite understand how URS works - I recommend a look through the documentation.

Take this (extreme) example:

          proc surveyselect data=sashelp.cars method=urs out=sample_cars sampsize=10000 seed=100;
          run;

 NOTE: The sample size, 10000, is greater than the number of sampling units, 428.
 NOTE: The data set WORK.SAMPLE_CARS has 428 observations and 16 variables.
 NOTE: PROCEDURE SURVEYSELECT used (Total process time):
       real time           0.02 seconds
       cpu time            0.03 seconds

Here I ask for 10,000 (out of 428 total records!), and get... 428 records. The important detail to pay attention to is the NumberHits variable. That says how many times each record was sampled.

If you want one record output for each hit, meaning you want those duplicates, you can add outhits to your PROC SURVEYSELECT statement. From the documentation on URS:

For unrestricted random sampling, by default, the output data set contains a single copy of each unit selected, even when a unit is selected more than once, and the variable NumberHits records the number of hits (selections) for each unit. If you specify the OUTHITS option, the output data set contains m copies of a sampling unit for which NumberHits is m; for example, the output data set contains three copies of a sampling unit that is selected three times (NumberHits is three). For information about the contents of the output data set, see the section Sample Output Data Set.

Here is my example modified to do just that.

          proc surveyselect data=sashelp.cars method=urs out=sample_cars sampsize=10000 seed=100 outhits;
          run;
 
 NOTE: The sample size, 10000, is greater than the number of sampling units, 428.
 NOTE: The data set WORK.SAMPLE_CARS has 10000 observations and 16 variables.
 NOTE: PROCEDURE SURVEYSELECT used (Total process time):
       real time           0.01 seconds
       cpu time            0.01 seconds

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...