SLID: Working Experience, Ontario vs. Quebec

In previous examples, you have already investigated the working experience in the province of Ontario alone. Now, let's take a look at a comparison study between Ontario and Quebec. In the SLID library, there are two subsets of SLID. They are called PONTARIO and PQUEBEC which includes the data from the provinces of Ontario and Quebec respectively. Before beginning the analysis, we access the SLID value formats by the OPTIONS statement.

The SAS statements below are contained in the file pontque.sas.

options fmtsearch=(slid);      /* lpvalue permanent format lib */
Then, we create a new dataset called PONTQUE which means the combination of Ontario and Quebec. The dataset is established by stacking PONTARIO and PQUEBEC together. The variable REGRE25C distinguishes the two provinces.

   /* Create a merged dataset includes both Ontario and Quebec samples */
      data pontque;
         set slid.pontario   slid.pquebec;
      run;

Choosing variables and make a working dataset

In the original datasets of Ontario and Quebec, there are 48 variables. We choose the variables that are of interest and keep them in PONTQUE. This will save some processing time. In this modified PONTQUE, you can rename it with other names or keep it with PONTQUE. Since we will not take other variables in at this moment, we can just overwrite the dataset PONTQUE with just 10 chosen variables.

   /* Choose variables for the merged sample */
      data pontque;
         set pontque      (keep=
      pupid26c      /* Random person ID 1994 */
      
      motn2g15      /* Mother tongue group 2 */
      immst15       /* immigrant */
      yrxft11c      /* Years of work experience 1994 */
      eage26c       /* Ext person's age 1994 */
      sex21         /* Sex */
      regre25c      /* Region 1994 */
      yrsch18c      /* Total yrs of schooling 1994 */
      ttwgs28c      /* Wages and slalaries all job 1994 */
      );

      id=put(pupid26c,8.);
      run;

Summary statistics

The next step is to look at the descriptive statistics of PONTQUE. First of all, we give a title to all the output pages. This title will appear on the top of each output page. Then, PROC CONTENTS shows all the information about the dataset. The POSITION option provides the location and the sequence of the variables in the dataset. In addition, you can also find out which variable is quantitative and which one is categorical. PROC MEANS gives you all the means and other related information regarding the quantitative variables. The quantitative variables are age (EAGE26C), wages and salaries (TTWGS28C), total years of schooling (YRSCH18C), and the criterion variable, working experience (YRXFT11C).

The categorical variables are sex (SEX21), mother tongue (MOTN2G15), region (REGRE25C) and immigrant (IMMST15). SEX21, REGRE25C and IMMST15 are dichotomous whereas MOTN2G15 has 3 levels (English, French and other). In this exercise, we will only compare the Ontario and Quebec data, therefore, our classification variable is REGRE25C. In subsequent exercises, you can try SEX21, IMMST15 and MOTN2G15 yourself.

   /* Create a title for subsequent output */
   Title 'SLID: Working Experience 1994 (Ontario vs. Quebec)';

   /* Look at the contents of the sample */
      proc contents data=pontque;
      run;

   /* Investigate the mean scores of quantitative variables */
      proc means data=pontque n min max mean std skew maxdec=3;
         var eage26c ttwgs28c yrxft11c yrsch18c;
      run;

Now, take a look at the Chi-sq analysis of immigrant status and their self-report of mother tongue group, for each province. Are there any difference between the two provinces?

   /* Chi-sq for immigrant status vs. mother tongue groups, by region */
      proc freq data=pontque;
      tables regre25c * immst15*motn2g15 / chisq nopercent;
      run;

Graphical summary statistics

PROC UNIVARIATE gives detailed information for each variable individually. There are three graphical output: a stem-and -leaf plot, a boxplot and a normal probability plot.

   /* Look at the univariate distribution of the numerical variables */
      proc univariate data=pontque normal plot;
         var eage26c ttwgs28c yrxft11c yrsch18c;
   run;

The DATACHK macro compares the quantitative variables side by side in boxplots and you can compares the distributions.

   /* Check for data normality */
      %datachk(data=pontque, var=eage26c ttwgs28c yrxft11c yrsch18c, ls=90);
      run;

The SPLOT macro compares the quantitative variables in boxplots for the two levels in REGRE25C, namely Ontario and Quebec. Is there any difference between the two provinces?

   /* Compare the quantitative variables with REGRE25C */
      %splot(data=pontque, var=ttwgs28c yrxft11c yrsch18c, class=regre25c);
      run;

Transformation of variables

The SYMBOX macro shows the distribution of the data with different powers specified. The power associated with the most symmetric boxplot should be chosen.

   /* Find suitable powers for transformation */
      %symbox(data=pontque, var=ttwgs28c, powers=0 0.5 1 1.5 2);
      %symbox(data=pontque, var=yrxft11c, powers=0 0.5 1 1.5 2);
      run;

After choosing the power, you have to create a new dataset which includes the transformed as well as untransformed variables. We rename the dataset as PONTQUE2. There are 12 variables in this new dataset, the original 10 variables plus the two transformed ones.

   /* Create new data set with variables transformed */
      data pontque2;
      set pontque;
      sqrtwex = sqrt(yrxft11c);
      sqrtwgs = sqrt(ttwgs28c);
      label sqrtwex = 'sqrt(Working experience 94)';
      label sqrtwgs = 'sqrt(Wages and salaries 94)';
      run;

Relations between quantitative variables, ignoring provinces

The LOWESS macro displays the relationship between two quantitative variables. The straight line is the regression line and the red curve is the smooth line. Note that we are using the original dataset.

   /* Examine the relationship between 2 variables, note that original data set is used */
      %lowess(data=pontque, y=ttwgs28c, x=yrxft11c, hsym=0.5, interp=r1);
      %lowess(data=pontque, y=ttwgs28c, x=yrsch18c, hsym=0.5, interp=r1);
      %lowess(data=pontque, y=ttwgs28c, x=eage26c, hsym=0.5, interp=r1);
      run;

Fitting a model

We can simply do a series of T-tests to see if there are significant differences between the two provinces. Then, we carry out PROC GLM to investigate the model in a more detailed analysis. Note that we are using the transformed dataset, PONTQUE2. What are your conclusions?

   /* Examine differences between two groups with dichotomous variables */
      proc ttest data=pontque2;
      class regre25c;
      var eage26c sqrtwgs sqrtwex yrsch18c;
      run;

   /* Fit a model */
      proc glm data=pontque2;
      class regre25c;
      model sqrtwex = eage26c|eage26c regre25c
               sqrtwgs|sqrtwgs
               yrsch18c|yrsch18c
               / solution;
      output out=pontque3 predicted=predict1 residuals=resid1;
      run;

Correlations

PROC CORR shows a numerical matrix of correlations between the quantitative variables.

   /* Look at correlation between quantitative variables for Ontario and Quebec separately */
      proc sort data=pontque2;
         by regre25c;
      proc corr data=pontque2;
         by regre25c;
         var eage26c sqrtwex sqrtwgs yrsch18c;
      run;

Other exercises

Now that you have compared Ontario vs. Quebec, you can also try other categorical variables, e.g. SEX21, IMMST15 and MOTN2G15. However, you cannot do a t-test with MOTN2G15 because it has more than 2 levels. But other tests in this exercise works fine with 3-level variable.