A simplisitic real-world study may focus on the correlation of higher education versus intelligence. The National Longitudinal Survey of Youth - 1979 Cohort
(NLSY79) provides an exceptional data source for this and similar types
of analyses. While education is a relatively simple variable to
measure, intelligence represents a more abstract idea. As an
approximation of intelligence, a standardized test can be used.
While not a perfect proxy, a standardized test provides a
reliably measurable standard. The NLSY79 provides Armed Services
Vocational Aptitude Battery (ASVAB) scores and percentiles, a
standardized test that most high school aged youths in the United
States have taken. The tutorial below explores importing and
manipulating NLSY79 data as well as seeking out correlations.
Importing the Data
NLSY79 data is available from the NLS
Web Investigator in comma-separated value (CSV) format, which
Draco can easily import. The example file is available in example/tutorial1
distributed with Draco. To import the data, start up Draco,
and proceed to File->Import->Text->Cross-Sectional
Data...
When presented with the Import
Panel Data dialog, select Overwrite Current to
place the imported data in the current Draco window. The Import Data dialog
should now appear. Click Browse...
and navigate to the example/tutorial1
directory and select the original
data.csv file which contains the raw NLSY79 data.
Clicking the View...
button will show the CSV file in a separate window. Notice
that
the first row contains variable names. Returning to the Import Data dialog,
enter 1 in
the Number of Header
Lines to Skip text box. Also, enter 1 in the Row Containing Headers
text box. Finally, make sure that only Commas are selected
in the Column
Separators Section. Press Import to load the
data. The contents of the CSV file should now appear in the Data window.
Modifying the Data
There area few things to notice about the recently imported data.
First, the variable names are all "R0124800" and other
similar number/letter combinations, which aren't particularly
descriptive. Second, some variables contain negative numbers;
negative values in the NLSY79 data set are used to explain missing
values and are of little interest in this example.
To begin making our data a little more friendly, start by renaming the
columns. Along with the CSV file, this tutorial provides a
text file named variable
labels.txt. This file contains descriptions of
each variable. In Draco open new text editor by going to Edit->Open text editor....
A blank text editor should now appear in the main window.
Click on the rightmost Open
icon on the toolbar, which allows the selection of a text file for
editing in the text editor. Navigate to and select variable labels.txt
in the examples/tutorial1
directory. The file describes each of the variables.
The first variable represents sex, the second the subject's
percentile on the Armed Services Vocational Aptitude Battery (ASVAB),
and the third the highest grade completed as of 2004.
To rename the variables, select one and proceed to Data->Column
Operations->Rename Column... in the menus when the
Data window is focused. When presented with the dialog,
delete the number/letter name and use a simpler designation:
Original
Simple
R0214800
sex
R0618200
asvab_percentile
R8496800
grade
The grade
and asvab_percentile
variables still contain negative values that are of no interest in this
analysis. These can be eliminted using Draco's equation
system. Select the grade
variable and select Data->Column
Operations->Set Column Values... in the menu.
The equations dialog should now appear. Eliminating
negative values requires the following logic:
if grade is greater
than
or equal to 0 then
use this grade value
otherwise
grade is now blank
To achieve this using the Draco equation system, enter the following in
the equations dialog:
if(ge(grade,0),grade)
Press set values. Once Draco performs all computations, all negative
values in the grade
column should have been replaced with blank values. Follow
the same procedure for the asvab_percentile
variable.
Creating a New Variable
As an initial estimate, it might be
interesting to see if there is any correlation between those who placed
in a high percentile in the ASVAB and those who completed some
post-high-school education.
A binomial variable describing whether grades higher than 12 were
completed should provide a reasonable estimate of whether someone
attended post-secondary education. To create the new
variable, select Data->Column
Operations->Add Column (or click the Add (+) button on the
toolbar). A new variable should now be present.
First rename the column to postsec.
The value of the new variable can be set simply using the
following logic:
if grade is greater than
12 then
set postsec to 1
otherwise
set postsec to 0
The above logic can be achieved using the following equation:
if(gt(grade,12),1,0)
Using the above equation, the postsec
variable should now contain only 1, 0, or blank.
A Logit Regression
For the initial estimate, a logit regression can be used to test the
hypothesis that subjects placing in higher percentiles in the ASVAB
testing are more likely to have completed some post-secondary
education. The new variable postsec was
created specifically for this regression.
Open the Logit regression window using Regress->Logit...
when the Data window is in focus. Once the window appears,
the depedendent and independent variables must be selected.
Select postsec
for the dependent variable by clicking the appropriate
check box. Select both sex
and asvab_percentile
for independent variables. The default options should be
sufficient for this analysis. Run the regression by clicking Perform Fit->Compute
Regression... while the Logit window is in focus.
The Iterative Progress dialog should appear and display the
convergence progress. Once the progress bar reaches full
width and the button changes from Stop
to Close,
the regression has converged. Press the Close button, and
the regression results should appear. The results should
appear similar to the following:
Results of the Logit MLE Regression Model
Regression Variable:
postsec
Sum Squared of the Residuals:
1.4143E03
Standard Error of the Fit:
0.43914
R-Squared Value:
0.58254
Adjusted R-Squared Value:
0.58243
Coefficient
Value
Std. Err.
t-Score
Constant
-1.20279
0.09161
-13.13018
sex
-0.21163
0.05232
-4.04511
asvab_percentile
0.03519
0.00101
34.89564
The results show that a subject with a high percentile is more likely
to attend post-secondary education (greater than zero asvab_percentile
coefficient regression value of 0.03519). An interesting note
is that women (sex=2)
are less likely to attend post-secondary education.
A Ordinary Least Squares Regression
A more general regression might be an ordinary least squares regression
of grade
versus sex
and asvab_percentile.
Select the Data window again and click on Regress->Least Squares
Fit... from the menu. Select grade as the
dependent variable and sex
and asvab_percentile
for independent variables. Again, click Perform Fit->Compute
Regression... while the Least Squares Fit window is
focused. A progress dialog will appear briefly while the
regression is computed. Finally, a new results window should
appear:
Results of the Multiple Regression Model
Regression Variable:
grade
Sum Squared of the Residuals:
2.9803E04
Standard Error of the Fit:
2.01587
R-Squared Value:
0.34627
Adjusted R-Squared Value:
0.34609
Coefficient
Value
Std. Err.
t-Score
Constant
10.81186
0.08268
130.77134
Coef 0 (sex)
0.3052
0.04713
6.47516
Coef 1 (asvab_percentile)
0.05112
0.00082
62.10456
Again, the coeficient corresponding to asvab_percentile is
positive, meaning subjects placing in higher percentiles will attend
more school. Another interesting result is that women (sex=2) are
more likely to complete more schooling than men. This result does
not contradict the earlier logit result; instead, it may suggest that
either:
women who attend post-secondary education attend much more than men
men are more likely to drop out prior to completing high school
These are new hypotheses that can now be tested with this data set, but are beyond the scope of this tutorial.
Conclusion
In this tutorial, the basics of data manipulation
and regression in Draco have been explored. This simple tutorial
showed that our intelligence proxy variable, ASVAB percentile, does
positively correlate with amount of education completed. Other
regression types behave similarly to the two used here. All
regression results can be saved separately as html files for inclusion
in other documents. A final version of this tutorial's worksheet
is located in the example/tutorial1 directory.