Tutorial 1 - Education Sample from the NLSY79


Introduction

A simplisitic real-world study may focus on the correlation of higher education versus intelligence.  The National Longitudinal Survey of Youth - 1979 Cohort (NLSY79) provides an exceptional data source for this and similar types of analyses.  While education is a relatively simple variable to measure, intelligence represents a more abstract idea.  As an approximation of intelligence, a standardized test can be used.  While not a perfect proxy, a standardized test provides a reliably measurable standard.  The NLSY79 provides Armed Services Vocational Aptitude Battery (ASVAB) scores and percentiles, a standardized test that most high school aged youths in the United States have taken.  The tutorial below explores importing and manipulating NLSY79 data as well as seeking out correlations.

Importing the Data

NLSY79 data is available from the NLS Web Investigator in comma-separated value (CSV) format, which Draco can easily import.  The example file is available in example/tutorial1 distributed with Draco.  To import the data, start up Draco, and proceed to File->Import->Text->Cross-Sectional Data...

When presented with the Import Panel Data dialog, select Overwrite Current to place the imported data in the current Draco window.  The Import Data dialog should now appear.  Click Browse... and navigate to the example/tutorial1 directory and select the original data.csv file which contains the raw NLSY79 data.  Clicking the View... button will show the CSV file in a separate window.  Notice that the first row contains variable names.  Returning to the Import Data dialog, enter 1 in the Number of Header Lines to Skip text box.  Also, enter 1 in the Row Containing Headers text box.  Finally,  make sure that only Commas are selected in the Column Separators Section.  Press Import to load the data.  The contents of the CSV file should now appear in the Data window.

Modifying the Data

There area few things to notice about the recently imported data.  First, the variable names are all "R0124800" and other similar number/letter combinations, which aren't particularly descriptive.  Second, some variables contain negative numbers; negative values in the NLSY79 data set are used to explain missing values and are of little interest in this example.

To begin making our data a little more friendly, start by renaming the columns.  Along with the CSV file, this tutorial provides a text file named variable labels.txt.  This file contains descriptions of each variable.  In Draco open new text editor by going to Edit->Open text editor....  A blank text editor should now appear in the main window.  Click on the rightmost Open icon on the toolbar, which allows the selection of a text file for editing in the text editor.  Navigate to and select variable labels.txt in the examples/tutorial1 directory.  The file describes each of the variables.  The first variable represents sex, the second the subject's percentile on the Armed Services Vocational Aptitude Battery (ASVAB), and the third the highest grade completed as of 2004.

To rename the variables, select one and proceed to Data->Column Operations->Rename Column... in the menus when the Data window is focused.  When presented with the dialog, delete the number/letter name and use a simpler designation:

Original Simple
R0214800 sex
R0618200 asvab_percentile
R8496800 grade

The grade and asvab_percentile variables still contain negative values that are of no interest in this analysis.  These can be eliminted using Draco's equation system.  Select the grade variable and select Data->Column Operations->Set Column Values... in the menu.  The equations dialog should now appear.  Eliminating negative values requires the following logic:

if grade is greater than or equal to 0 then
use this grade value
otherwise
grade is now blank

To achieve this using the Draco equation system, enter the following in the equations dialog:

if(ge(grade,0),grade)

Press set values. Once Draco performs all computations, all negative values in the grade column should have been replaced with blank values.  Follow the same procedure for the asvab_percentile variable.

Creating a New Variable

As an initial estimate, it might be interesting to see if there is any correlation between those who placed in a high percentile in the ASVAB and those who completed some post-high-school education.

A binomial variable describing whether grades higher than 12 were completed should provide a reasonable estimate of whether someone attended post-secondary education.  To create the new variable, select Data->Column Operations->Add Column (or click the Add (+) button on the toolbar).   A new variable should now be present.  First rename the column to postsec.  The value of the new variable can be set simply using the following logic:

if grade is greater than 12 then
set postsec to 1
otherwise
set postsec to 0

The above logic can be achieved using the following equation:

if(gt(grade,12),1,0)

Using the above equation, the postsec variable should now contain only 1, 0, or blank.

A Logit Regression

For the initial estimate, a logit regression can be used to test the hypothesis that subjects placing in higher percentiles in the ASVAB testing are more likely to have completed some post-secondary education.  The new variable postsec was created specifically for this regression.  

Open the Logit regression window using Regress->Logit... when the Data window is in focus.  Once the window appears, the depedendent and independent variables must be selected.  Select postsec for the dependent variable by clicking the appropriate check box.  Select both sex and asvab_percentile for independent variables.  The default options should be sufficient for this analysis.  Run the regression by clicking Perform Fit->Compute Regression... while the Logit window is in focus.  The Iterative Progress dialog should appear and display the convergence progress.  Once the progress bar reaches full width and the button changes from Stop to Close, the regression has converged.  Press the Close button, and the regression results should appear.  The results should appear similar to the following:

Results of the Logit MLE Regression Model

Regression Variable: postsec

Sum Squared of the Residuals:
1.4143E03
Standard Error of the Fit:
0.43914
R-Squared Value:
0.58254
Adjusted R-Squared Value:
0.58243

Coefficient Value Std. Err. t-Score
Constant
-1.20279
0.09161
-13.13018
sex
-0.21163
0.05232
-4.04511
asvab_percentile
0.03519
0.00101
34.89564


The results show that a subject with a high percentile is more likely to attend post-secondary education (greater than zero asvab_percentile coefficient regression value of 0.03519).  An interesting note is that women (sex=2) are less likely to attend post-secondary education.

A Ordinary Least Squares Regression

A more general regression might be an ordinary least squares regression of grade versus sex and asvab_percentile.  Select the Data window again and click on Regress->Least Squares Fit... from the menu.  Select grade as the dependent variable and sex and asvab_percentile for independent variables.  Again, click Perform Fit->Compute Regression... while the Least Squares Fit window is focused.  A progress dialog will appear briefly while the regression is computed.  Finally, a new results window should appear:

Results of the Multiple Regression Model

Regression Variable: grade

Sum Squared of the Residuals:
2.9803E04
Standard Error of the Fit:
2.01587
R-Squared Value:
0.34627
Adjusted R-Squared Value:
0.34609

Coefficient Value Std. Err. t-Score
Constant
10.81186
0.08268
130.77134
Coef 0 (sex)
0.3052
0.04713
6.47516
Coef 1 (asvab_percentile)
0.05112
0.00082
62.10456


Again, the coeficient corresponding to asvab_percentile is positive, meaning subjects placing in higher percentiles will attend more school.  Another interesting result is that women (sex=2) are more likely to complete more schooling than men.  This result does not contradict the earlier logit result; instead, it may suggest that either:
These are new hypotheses that can now be tested with this data set, but are beyond the scope of this tutorial.

Conclusion

In this tutorial, the basics of data manipulation and regression in Draco have been explored.  This simple tutorial showed that our intelligence proxy variable, ASVAB percentile, does positively correlate with amount of education completed.  Other regression types behave similarly to the two used here.  All regression results can be saved separately as html files for inclusion in other documents.  A final version of this tutorial's worksheet is located in the example/tutorial1 directory.  
Copyright © 2008 Approximatrix, LLC
Text licensed under the Creative Commons Attribution-Share Alike 3.0 License
Approximatrix, LLC makes no claims of copyright on any NLS data
DracoTM and the Approximatrix logo are trademarks of Approximatrix, LLC
Other trademarks are property of their respective owners