Course Description:
Introduction (through both lecture and supervised work, integrated in a practicum format) to elementary use and overview of SAS Version 8.2 for Windows, including data file organization, data management, data import and export (from/to other formats and operating systems), and basic analysis. Use of SAS on other platforms supported by ITC (Mac, Unix) will be addressed but not explicitly instructed.
This document is the first part of the Introduction to SAS workshop; the second part is also available online.
Prerequisites
Familiarity with DOS (file paths and directory structures) and Microsoft Windows (booting, menus, mouse, scrolling, saving, etc.).
Table of Contents
- Workshop Basics
- Workshop Structure
- License Warning
- Program Overview
- Syntax Conventions
- Obtaining the Files Used in this Tutorial
- Getting Started
- Starting SAS for Windows
- Managing a Dataset
- Making a Dataset Active
- Saving Commands
- Examining Log and Output
- Data Preparation
- Running Frequencies
- Missing Values
- Variable Labels
- Data Analysis
- Examining Differences
- Recoding Variables
- Saving your Data
- Generating Cross tabulations
- Documentation and Help
- SAS Tutorial
- SAS Manuals
- Sample Syntax
- Web Documentation
- Consulting Services
Workshop Basics
-
Workshop Structure
The goals of this document (and workshop) are to provide a brief introduction to SAS, explain a few fundamental commands, and practice some of the many features of SAS for Windows.
In this first session, we will focus on basic data management procedures, including how to:
- open and manage a data file,
- define data (resolve missing values, and create variable labels and value labels),
- run basic descriptive statistics (frequencies and descriptives),
- transform data (recoding, and computing new variables), and
- perform basic analytical procedures (cross-tabulations and regression analysis).
In the next session, we will look at additional procedures you may need in order to be productive in a SAS environment, including how to:
- read other data formats (e.g. raw ASCII data, Excel files, SPSS files)
- save the current dataset as a SAS permanent dataset, and
- several advanced procedures including IF-THEN statements, merging, and macros.
Both sessions will concentrate on introductory manipulation of SAS for Windows; they will touch on interpretation of output and invocation of particular procedures. But they are not a complete beginner's guide. The SAS for Windows Tutorial is highly recommended for such issues, and covers topics beyond the scope of these documents.
License Warning
SAS for Windows is a product of SAS Institute Inc. Information Technology and Communication (ITC) has a site license that permits ITC to distribute SAS to faculty, staff, and students for use, in Charlottesville only. In addition, ITC agrees to provide user support: if you have a question regarding the SAS software, call the ITC Research Computing Support Center at 243-8800. Unauthorized copying and use of SAS software violates the copyright and site license and will result in ITC losing its site license. Please use SAS legally.
ITC provides access to a number of other general purpose statistical packages for faculty, staff, and graduate students. For a listing of available software and associated information, please see our Researchers website.
Program Overview
Variation: The use of SAS can vary from simple to complex, depending on the condition of the data and sophistication of the analysis. SAS programs may include a variety of activities: data input, data transformation, elementary or advanced statistical analysis, creation of data sets, or custom output.
Routine: Any data analysis (whether with SAS or another program) has five steps:
- prepare data,
- prepare commands,
- invoke commands,
- examine output, and
- save your work.
In practice, you will typically repeat several of these steps, particularly as you correct errors and re-invoke commands.
Statements: SAS programs consist of SAS statements, largely contained in DATA and PROC steps. DATA steps introduce and prepare data for use in SAS. PROC steps issue procedures, to actually perform analysis and generate desired output. Each PROC step calls a program to process (e.g. list, sort, compute, plot, and/or print) the information stored in a data set using keywords such as PROC FREQ for frequencies, PROC MEANS for descriptive statistics, and PROC REG for regression.
Order: A typical SAS run will encompass both DATA steps and PROC steps. You can use several DATA steps to create multiple data sets, and these in turn can be processed by several PROC steps. These steps may occur in any order, though a DATA step must precede a PROC step using that data.
Errors: By default, SAS does not write error messages and warnings to your terminal as it executes. When SAS completes execution it writes a summary of errors encountered to the LOG file . You should always begin the examination of the results of your SAS program by looking at the LOG file for ERRORS and NOTES.
Etc: SAS has many capabilities that this course does not have time to cover. More than just a statistical analysis tool, it is capable of complex report generation, database management, and graphics. For further details, see the SAS Language Usage and SAS Language Reference manuals (available at the Research Computing Support Center in 244 Wilson).
Syntax Conventions
Keywords begin most statements and are recognized as commands (e.g. DATA or PROC). They are reserved as commands and should not be part of file or variable names. (More on names later.)
Case sensitivity: SAS does not care if commands are typed in UPPERCASE or lowercase. (Unix users should note, however, that the RS/6000 AIX operating system is case-sensitive.)
Spaces: Commas, spaces, and the "equals" sign are NOT used interchangeably in SAS. When items are separated by spaces, the number of spaces is not important, since "extra" spaces are ignored.
Lines: must end with a semi-colon, although statements may be spread over several lines and can begin anywhere on a line. All statements must end by column 80, although you do not have to use all 80 columns. It is also possible (but not recommended for debugging purposes) to have multiple statements on a single line.
Errors: Common errors include omitting a semi-colon at the end of a statement, and omitting a period in a format modifier (which we'll discuss in the second session).
Obtaining the Files Used in this Tutorial
There are several files that have been created for use in this tutorial. The tutorial assumes that the files are saved on the hard drive of your PC in an area named C:/Temp (of course you may choose to save these files elsewhere). If you have problems downloading or using the files using Netscape, using Internet Explorer may resolve the problem.
To save the following files to the C:/Temp area of your hard drive - right click on the link corresponding to each file. Choose the Save As... option from the pull down menu. When the Save As... dialog box opens, use the navigation tools to designate the C:/Temp area as the save in area. Then hit the save button to save the file in the C:/Temp area.
Using the above method, save the following files in your C:/Temp directory.
- bank.dat - The ASCII file containing the raw bank data
- bankdata.sas7bdat- The SAS dataset (for Part 1)
- bankend1.sas7bdat- The SAS dataset (for Part 2)
- course0.sas - A file containing the SAS commands to read bank.dat
- course1.sas - A file containing the sas commands for part 1 of the tutorial
- course2.sas - A file containing the sas commands for part 2 of the tutorial
- bank.xls - an Excel file containing the bank data
- sasdata.dat - a raw data file
Click on each of these five selections, in this order:
- the START button on the screen.
- the PROGRAMS listing.
- the STATISTICAL listing.
- the SAS folder listing.
- the SAS 8 listing (or listing for other version, as instructed).
SAS for Windows consists of five sub-windows. By default, the EXPLORER, ENHANCED EDITOR, and LOG windows are initialy open, while the OUTPUT and RESULTS window are hidden under these three.
You can position and resize these windows any way you wish. To bring a partially hidden window to the front, click once anywhere on it. To find a window that not showing at all, select the item on the SAS window bar. You can also right-click anywhere in the LOG or OUTPUT windows, choose VIEW from the pop-up context menu, and select any window from the list at the top.
The Enhanced Editor window provides a number of useful editing features, including color coding and syntax checking of SAS language.
The Results window helps you navigate and manage output from SAS programs that you submit. You can view, save, and print individual items of output. By default, the Results window is positioned behind the Explorer window and it is empty until you submit a SAS program that creates output. Then it moves to the front of your display.
In the Explorer window, you can view and manage your SAS files, and create shortcuts to non-SAS files. Use this window to create new libraries and SAS files, to open any SAS file, and to perform most file management tasks such as moving, copying, and deleting files.
To create a new library, sasclass, with directory C:/Temp, click the Explorer window, then select FILE, NEW in the pull-down menu. You will get to the New Library dialog box, type in "sasclass" next to Name, and "C:/Temp" next to Path, as shown here:
Now click OK, a new library called sasclass will show up in the Explorer window.
Note that SAS uses a single icon on the start bar, whereas SPSS, for example, has a separate icon for each syntax, output, and data window.
In this session, we will enter commands by typing directly into the ENHANCED EDITOR window and run commands directly from this window. (Alternately, you could create such a file in any text editor or word processor. Note, however, that you must save the data or command file as an ASCII text or DOS file, not in the format of the word processor you are using.) You may submit a single command, an entire command file, or merely part of the file; we will try each in this session.
First, let's look at some data.
Managing a Database
In many statistical programs, data is typically created and displayed in a conventional spreadsheet format (a grid of rows and columns, where each row is a case and each column is a variable). SAS does not typically display data in this format, but there is a VIEWTABLE option (effective with version 6.12) that will let you look at data in this way.
Opening a SAS data file using the VIEWTABLE option
Step 1: From the drop-down menus, choose TOOLS, TABLE EDITOR, as shown here:
Step 2: To view a data set, you will need to get to the Open File dialog box. Select FILE and OPEN as shown here:
Step 3:You will next need to select the location of the dataset which you would like to manage. Select the library where the file is. Next, select the bankdata SAS dataset. (Note that only SAS datasets are listed.) In the image above, the file bankdata.sd2 is listed as BANKDATA. Once you've navigated to the correct path and see the file you want listed, mouse click once on that filename to highlight it.
Step 4: Click OPEN, and the VIEWTABLE window should open as follows:
Step 5: There are two modes in SAS for looking at the table. The "browse" mode only allows you to look at the data. So that we can also manipulate the data set in VIEWTABLE, choose EDIT, Edit Mode, as shown here:
If you wanted or needed to, you could edit the actual data in this window -- changing a particular value or values, for instance. You can also investigate and manipulate labels given to particular variables.
Labeling a Variable
To associate a label with a variable (within VIEWTABLE) follow these steps:
Step 1: Click at the top of the third column to highlight that column's variable name.
Step 2: Double-click that variable name (bdate) to see the SAS variable dialog box.
Step 3: Type "Birth date of the respondent" in the white box marked "Label"
Step 4: Click on APPLY then CLOSE.
Then close VIEWTABLE with selecting FILE > CLOSE.
Using VIEWTABLE is fine to look at or edit data, but won't make the data file active (i.e. available for you to perform statistical analysis on it).
Making a Dataset Active
To create a SAS dataset, data may be included directly in the SAS command file (using the cards syntax) but is more often read into SAS from a separate data file. Frequently, this data will be in some non-SAS format, such as a columnar ASCII (text or DOS) file or an Excel file. In the second session, we will discuss how to create and import these other formats. We'll begin today with the bankdata dataset, which is already in SAS format.
Below we demonstrate the use of a series of SAS commands to operate on the bankdata data set.
Step 1: Clear the data from the ENHANCED EDITOR window by clicking in the ENHANCED EDITOR window to make it the active window (or by selecting it from the Windows menu), then select Edit > Clear All from the pop-up menu, as shown here:
Step 2: Type the following statements in the ENHANCED EDITOR window. (Remember that you are only setting up the commands here, which is not the same as invoking those commands.)
PROC CONTENTS DATA = sasclass.bankdata ; PROC PRINT DATA = sasclass.bankdata ; RUN ;
What these commands do:
1) The first line runs the procedure "contents" to view the contents of the workdata SAS data set (which resides in the library area called sasclass).
3) The second line prints the data contained the the bankdata data set and the last line submits the previous two lines to the SAS system for processing.
Saving Commands
You have just written your first SAS program! Now, save it with the following steps:
Step 1: Choose FILE, SAVE AS from the menu bar
Step 2: Click the down arrow, scroll up, and choose C:
Step 3: Double-click on the Temp directory icon
Step 4: Type practice.sas in the filename box, and
Step 5: Click SAVE or press Enter.
Examining Log and Output
Running a set of commands at a time (rather than issuing commands individually, or running an entire analysis file at once) is referred to as non-interactive use of SAS. Running jobs non-interactively allows you to easily find and correct mistakes. So let's check and see if you made any.
Choose VIEW then OUTPUT from the menu bar (to select the OUTPUT window), and you'll see.... an empty window, because you haven't yet submitted the commands. As mentioned earlier, typing commands into the Program Editor only creates a command file. Invoking SAS to actually "run" commands is a separate procedure:
Step 1: Select the ENHANCED EDITOR window. When the menu bar for the ENHANCED EDITOR window is dark blue rather than gray, this indicates that the window is selected.
Step 2: To submit the commands, under RUN in the menu bar, choose submit. Or you could right-click your mouse and choose SUBMIT ALL
By default, SAS writes output into the OUTPUT window, but writes error messages and warnings to the LOG window. Whether there even is any output depends on whether there are errors in the LOG window. You should always examine your LOG window before the OUTPUT window.
Under VIEW in the menu bar, choose LOG to view the LOG window. Look closely for ERROR messages and at the descriptive NOTES.
In this case, there are no errors listed: Each of the procedures is followed by a NOTE indicating that the procedure was completed within fraction of a second -- or number of seconds, with a larger dataset. (If there were errors, and you needed help debugging your program or interpreting the output, you can save and print the LOG file and bring it to a statistical consultant for assistance.)
Since the LOG window shows no errors, next choose VIEW then OUTPUT from the menu bar. The output window contains the results of the two procedures. First, here is the output from PROC CONTENTS:
Here's part of the PROC PRINT output. (The full output is 19 screens full, showing all 474 cases.)
Note that observation number 6 has no value for GENDER, and observations 9 and 28 have a value of X for JOBCAT. You will almost always have "missing values", whether you start with a collected dataset or collect it yourself, you will need to modify them before using the data.
Data Preparation
Regardless of where your data came from, or who collected it, you will almost always need to make some modifications before you begin your analysis. In particular, you may need to resolve "missing values" as well as provide meaningful labels for each variable and for each value of each variable. Before we do that, we'll extend our simple SAS command file to look deeper at what's there already.
Running Frequencies
Prior to modifying any dataset, you should consider what it looks like already. One way of doing that would be to scan the PROC PRINT output and visually assess the data, as we did a bit of above. That might be fine if we only had those 30 cases, but not for larger datasets. Much easier, even with only 474 cases, is to let SAS do the work for you. The commands below produce several useful tables.
PROC FREQ DATA = sasclass.bankdata ; TABLES gender ; PROC MEANS DATA = sasclass.bankdata ; VAR salary ;
Step 1: Type these commands at the bottom of what already appears in the Editor,
Step 2: highlight these statements,
Step 3: Then, in order to execute them in SAS, choose RUN and SUBMIT, as you did above.
Always remember to check the LOG file before the OUTPUT window. Do that now and you'll see ... nothing new, because here you submitted the command line but not the requisite RUN statement.
Step 4: Add:
RUN ;
to the command file, then highlight and submit this command segment. The LOG should now show:
Note that you will typically use PROC FREQ for categorical variables (nominal or ordinal) and PROC MEANS for continuous (interval or ratio) data -- in this case, gender and salary, respectively.
If you look at the OUTPUT window and scroll up a bit, you should see this PROC FREQ output, which shows that our dataset includes 215 women, 255 men, and 4 respondents of unknown gender:
On the next page, the PROC MEANS output shows that salaries for this sample range from $15,750 to $135,000, with a mean under $35K and a standard deviation above $17K.
From this we can estimate that two-thirds of the individuals sampled earn between $18K and $52K (within one standard deviation of the mean) and less than 5% earn over $68K (two standard deviations above the mean). We can also see that, unlike the data for gender, there is valid data on the salary of all 474 respondents.
Missing Values
System missing values are literally missing -- there is simply no data for some cases. Any variable for which a valid value cannot be computed or read from raw data is assigned the SAS system-missing value. In the dataset we're working with, there are four cases (including observation 6) for which no value is recorded for GENDER.
There is usually no procedure necessary to address these values (such as "setting blanks to zero"), because SAS automatically accepts the absence of values (or a single period surrounded by spaces) as system-missing data. However, as we will discuss in the second session, with list-style input, you must use periods (not blanks) to indicate missing values.
User missing values are non-blank values for numeric variables which you elect to ignore for purposes of analysis, and specifically designate as "missing". They typically indicate non-acceptable responses or otherwise differentiate among cases.
In some programs (e.g. SPSS), perhaps 0 indicates respondents who refused to answer a survey question, 97 indicates those for whom the question was inappropriate or inapplicable, and 98 indicates a response of "don't know". In SAS, these "user-defined missings" may be any of the 26 CAPITAL or lowercase letters of the alphabet (numeric values are not allowed) or the underscore. In our dataset, there are seven cases (including observations 9 and 28) which have a value of X for JOBCAT.
You may sometimes wish to analyze the distribution of missing cases, or to explain missing data with other data. For example, perhaps retirees are less likely to answer questions about sexual habits. But you will usually exclude both missing values, and the cases associated with them, from analysis.
The MISSING statement is used to declare user-missing values for numeric variables as missing within a DATA step. Character variables must be represented by a period in list style input. It is not possible to use user missing values for character variables.
DATA sasclass.missingdata; MISSING X ; input var1 charvar$; cards; 4 aaa 5 bbb X ccc 3 . . eee ; RUN ;
These statements tell SAS to use X for var1 for missing values, and thus to ignore those cases with X reported as var1 in any analysis. Here are how you would use missing values in SAS statements:
if var1 = .X then do; if var1 = . then do; if charvar = ' ' then do;
Note that the first if statement only picks up the third observation, whereas the second one picks up the fifth observation, although both are treated as missing in statistical analyses. A frequency table for the variable JOBCAT would show:
Variable Labels
So far, we only have variable names (short names for what the variable is) and numeric values for the variables. The output of your procedures will be easier (for you and others) to understand when you instead use variable labels and value labels . Value labels are more difficult in SAS than in some other programs, such as SPSS, and will be addressed in the second session. But we're ready now to apply variable labels to this data set.
We already implemented one variable label while in VIEWTABLE, when we labeled BDATE as "Birth year of the respondent". A command approach can do the same thing for many variables in one step. The LABEL statement, used in a DATA step, provides labels for variables. Although variable names are limited to eight characters, the label may be up to 40 characters long, including blanks. For example:
LABEL ses = 'Computed Socio-Economic Status' ;
We'll be using five variables in our data set, so should label at least those five.
Step 1: Add these lines to the bottom of your command file in the EDITOR:
PROC DATASETS library = sasclass ; Modify bankdata ; LABEL gender = "Respondent's Sex" educ = 'Number of years of education' preexp = 'Previous experience (months)' bdate = 'Birth year of the Respondent' jobtime = 'Months since hired' salary = "Respondent's salary" ; RUN ; quit;
Note that an apostrophe can be used in a label if the label is surrounded by double-quotations, as was done above for gender and salary; otherwise, single-quotes are fine.
"quit" statement above causes SAS to immediately stop the procedure that is running. Datasets is an interactive procedure, you can submit more optional commands and SAS will process these without running the whole procedure all over. Since we are done with this procedure, we quit it immediately after running our commands.
Step 2: Next, add these lines so that you can see the effect of having specified variable labels:
PROC FREQ DATA = sasclass.bankdata ; TABLES gender ; PROC MEANS DATA = sasclass.bankdata ; VAR educ preexp bdate jobtime salary ; RUN ;
Step 3: Then choose RUN from the main menu and select SUBMIT option.
If you look at the OUTPUT file, you'll see that the frequency tables now have the variable label as a heading rather than the variable name; and the means tables have a new column for the labels. For example, here is part of the means table:
Data Analysis
We have an active data set with labels, and we've taken care of missing values. Now our data is ready for further analysis. We've already looked at a frequency distribution for salary, but we're interested in explaining that distribution: What accounts for differences in salary among the bank employees? We'll start by looking at differences between men and women, and then use other variables and procedures to further explore the relationship.
Examining Differences
We've already gotten a general picture of the SALARY variable for the entire sample. What we're interested in now is the difference in salaries between men and women. There are actually several ways to compare male and female salaries with SAS. Let me show you a few possible procedures, and then have you do only one of them.
Separate lists In the following command segment, the first PROC step sorts the data by gender, a necessary prerequisite for the second PROC, which lists all of the data for female cases and then all of the data for male cases.
PROC SORT DATA= sasclass.bankdata ; BY gender ; PROC PRINT; BY gender ;
There is no output from the first PROC, and the lists are too long to reproduce -- and probably too long to be of much use. Note that we did not specify the data= option in the print procedure. In the cases where we do not specify the data set explicitly, SAS uses the default data set, which is the most recently created or modified data set.
Cross-tabulation: The next PROC is a bit more useful, by allowing quicker comparisons between men and women at each level of salary.
PROC FREQ ; TABLES salary * gender ;
However, the table is no smaller (there's actually more information) and no easier to summarize, because there are so many levels of salary.
Summary comparisons: Both of those procedures might be useful (particularly with very small data sets), but we need summary comparisons to make our case. In addition to showing the mean salary for the entire sample, SAS can also provide means for specific groups, allowing us to see how the mean salary differs between men and women.
Step 1: Type the following lines at the bottom of your command file:
PROC SORT DATA= sasclass.bankdata ; BY gender ; PROC MEANS DATA = sasclass.bankdata ; BY gender ; VAR salary ; RUN ;
Step 2: To submit these commands, hightlight this section and, this time try clicking on the "running man" icon on the toolbar:
Remember to check the LOG file first, but the output from this procedure should include this:
As you can see, there is indeed a difference in salaries between men and women within the bank: On average, men make almost $15,000 more than women (with means of $41K vs. $26K), although men's salaries are almost three times as a dispersed as women's (with standard deviations of $19 vs. $7K).
Possible explanations for this difference might include differences in education, previous experience, age, length of tenure with the bank, and sexual discrimination. Fortunately, our data set includes information about the first four of those. We'll start by looking at differences in education: Perhaps men are paid more than women because they are better educated (a difference which itself would have to be explained at some point). First, we will need to recode the education variable.
Recoding Variables
Our data on education is currently interval, indicating an actual number of years of education completed by the respondent, from 0 to 16. For now, our comparison will be easier if we recode this variable (i.e. regroup the values, and thus the cases) so that only three categories are present:
- 16 or more years of education (presumably those with college degrees)
- 12 to 15 years of education (presumably those with only a high school diploma), and
- less than 12 years of education (presumably those without a high school diploma).
Recoding conventions: There are three major conventions for recoding, and you should take care to choose one appropriate to your data and research question. The convention we're employing is logical divisions, using thresholds associated with expected differences (12 years for high school and 16 for college). A second convention is to use equal divisions, such as age groupings by decade (teens, 20s, 30s, 40s, etc.) A third convention, and the most robust, considers the shape of the distribution being recoded to find empirical concentrations, independent of expectations and equality.
Recode from a safe copy: Any variable can be recoded simply and quickly, and the data set is altered without any option to leave it unsaved. Consequently, there's a risk that you may recode something and then not be able to ascertain the original differences. (For example, right now we want an ordinal measure of education, with three levels, but later we may want to consider the 16 levels separately.) To eliminate this risk, you should recode a "safe copy" of the original variable -- an exact copy of the variable, with the same values for each case -- rather than the original variable itself. Creating a "safe copy" is easy. A stand-alone command could do it:
educ2 = educ ;
However, that adds a pass through the data without actually doing any recoding. That's not much of an issue with our dataset of only 474 cases, but with a large data set (e.g. 10,000 cases) it would add a lot of time. Instead, you can also create the safe copy at the same time that you recode.
Re-assign values: Recoding is most easily done using IF/THEN statements, as in the following:
DATA sasclass.bankdata ; SET sasclass.bankdata ; IF educ GT 0 AND educ LT 12 THEN educ2 = 1 ; ELSE IF educ GE 12 AND educ LE 15 THEN educ2 = 2 ; ELSE IF educ GE 16 THEN educ2 = 3 ;
In each line, the new variable EDUC2 (a safe copy of EDUC) is created and set based on the value of the original variable EDUC: The first line identifies those cases with greater than (GT) 0 years of education but less than (LT) 12, and puts them in the first group, those who did not complete high school. The second line identifies those as greater than or equal to (GE) 12 and less than or equal to (LE) 15 and identifies those as the second category, those who complete high school and may have attended some college. The third line identifies those with greater than 16 years of education as belonging in the third group, whom we presume have finished college.
Validating recodes: The first thing you should do after recoding any variable (including creating a new variable and regrouping the categories) is to look at a frequency distribution of it. Add these lines:
PROC FREQ ; TABLES educ ; TABLES educ2 ; RUN ;
then submit these nine lines together, and you should see the following output:
The 53 cases with 8 years of education are in the first category; the 312 cases (190 + 6 + 116) with 12 to 15 years of education are grouped in the second; and the 109 cases ( 59 + 11 + 9 + 27 + 2 + 1) with 16 or more years of education are in the third.
Note that when you re-assign values, you should also re-assign value labels, so that someone coming later (including yourself) will know what categories 1, 2, and 3 of EDUC2 mean. Again, we'll talk about value labels in the second session -- for now, just remember what 1, 2, and 3 stand for.
Utilizing new variable: Now that we've recoded education, we can consider whether educational differences explain differences in salary. One way would be to look at (disaggregated) differences in means, as we did above disaggregating salary by gender. The following will do the same by educ2:
PROC SORT DATA= sasclass.bankdata ; BY educ2 ; PROC MEANS ; VAR salary ; BY educ2 ; RUN ;
And the output from this procedure should include the following:
This output suggests that salaries do vary by education level: On average, those with less than 12 years of high school earn $24K, those with 12-15 years earn $28K, and those with 16 or more years earning over $57K. College graduates earn on average twice what those without a high school degree earn, and their mean is more than the maximum earned by employees without a high school diploma.
Perhaps, then, education explains salary differences. But does it explain salary differences between men and women? First, we would need to know whether there are educational differences between men and women. To do that, we'll turn to cross-tabulations (a.k.a. cross-classification tables). But before that, we should save the data changes we've just made.
Saving Your Data
When SAS reads data with a DATA statement, it creates a binary file in a format which only SAS can understand, called a SAS dataset. All SAS statistical procedures operate on what is called the current (or active) data set. A file created by a DATA statement automatically becomes the current dataset, and any new variables added by data transformations or IF statements are added to this SAS dataset as they are created. Other procedures, such as PROC STANDARD, have options that add new variables, such as Z-scores or predicted values to the active system file for analysis by other procedures.
SAS data sets may be either temporary or permanent. The dataset normally disappears at the end of an SAS job. However, you may save SAS the work (and yourself the time) of recreating the dataset next time (and on each successive run on that data) by creating a SAS permanent dataset for later use.
SAS determines whether a dataset is to be temporary or permanent from the name you give it. All SAS filenames are really two-level names, in the form libref.membername. If you want the data set to be temporary (not saved to your directory), then use only the membername: When you use only a membername, SAS provides the default libref WORK and deletes the data set when it is done running. If you need to save the data set, then use both a libref and membername: If a permanent data set is created, it is stored in a SAS Library and may be used in other SAS programs without re-creating it. (Note that on UVA systems, a SAS Library is a logical concept, not a physical entity.)
Notice in the command file above that the name of the SAS dataset, sasclass.bankdata, has two parts. With such a two-part name, the dataset is permanently stored for future use. We did that so we can use the data in the second workshop session. But it is always good to save to a new dataset, just in case.
Generating Crosstabulation
Now that we have recoded education, we can assess educational differences between genders. Again, making this comparison will help us assess the plausibility that differences in educational levels actually explain the difference in salaries between men and women.
Since both the new education variable and the gender variable are categorical variables (ordinal and nominal, respectively), the appropriate procedure to assess difference in educational level across the two genders is to generate a "cross-classification" table, or "cross-tab". We do this with the following commands, similar to what you saw previously:
PROC FREQ ; TABLES educ2 * gender ;
You can type those at the bottom of your command file, followed by a RUN statement, and then highlight and submit them. You should see this output:
While 32.16% of male employees have at least a college degree, only 11.63% of female employees do -- the male bank employees are almost three times as likely to have a college degree as the females. And we've already seen that salaries vary across values of education. Thus, it is at least plausible that education accounts for salary differences between male and female bank employees.
In order to fully assess whether education accounts for the salary differences, we will utilize a procedure called Linear Regression. This procedure also allows us to examine the influence of other factors that we hypothesized might account for salary differences -- age, previous job experience, and length of tenure at the bank. We'll pick up with that at the beginning of the next session.
Documentation and Help
SAS Online Documentation offers easy access to the most frequently used SAS documentation (previously available only in print), including news about SAS components that are shipped as experimental or beta. SAS for Windows includes pull-down help as well as ASSIST menus and dialogue boxes. You can also use the HELP command from the SAS for Windows command line.
SAS Tutorial
In SAS version 8.2 the online tutorial may be accessed from the Help drop-down menu by selecting the SAS Online Tutor under Books and Training. SAS Institute provides an on-line computer-based training (CBT) tutorial. The SAS/TUTOR module is licensed and available for SAS for Windows and SAS on the RS/6000s. In order to use this program you need to obtain the SAS/TUTOR training notes, which are available for purchase at the University Bookstore's PROFS Publishing. The cost is based on the cost of Profs Publishing photocopying the original notes. (If you have questions about getting a copy of these notes, please e-mail res-consult@virginia.edu) Once you have these notes, you can invoke the SAS/TUTOR module.
In SAS for Windows, or OS/2, version 8 SAS/TUTOR is invoked by starting up SAS, then selecting, online training from the Help menu.
In SAS on Macintosh, double-click on the SAS/Tutor icon in the SAS folder.
On the RS/6000s, the SAS/TUTOR for SAS, version 6.09, is started by typing /sas/sastutor at the Unix prompt, You can choose an item by "tabbing" to it and pressing enter to select it. Please note that the SAS/TUTOR for the RS/6000s is best used in the X-Windows interface. See the ITC document, U-025 for details on using the X-Windows interface to SAS on the RS/6000s.
You may also want to look at the use of SAS/ASSIST for creating a command file by using the pull-down menus and selecting the commands needed in their appropriate order.
SAS Manuals
SAS Institute, Inc. publishes a large library of manuals and statistical procedure guides. Some of these are available in the trade books section of University of Virginia bookstore. All of the manuals listed may be purchased directly from the SAS Institute, Inc., or may be ordered through any bookstore. They are also available, for reference use only, at the ITC Research Computing Support Center, Wilson Hall Room 244. There are manuals in the Research Center that can be checked out for up to 24 hours. Speak with the computing consultant in Room 244 Wilson in order to check out a manual.
Sample Syntax
Another aid to understanding SAS may be obtained by looking at sample programs provided by SAS. The programs come complete with data, and may be examined for ideas on how to set up a procedure, or may be run so that the output of the program may be studied.
The location of the SAS Institute, Inc. example files for PC SAS and SAS for Windows depends on the choices made during installation of these products on your module. In general, they are in the SAS subdirectory along with the module to which they pertain. For example, SAS Institute, Inc. sample files for the STAT module are generally located in: /SAS/STAT/SAMPLE, whereas the sample files for the ETS module would be in: /SAS/ETS/SAMPLE.
On the RS/6000s, sample files from SAS Institute, Inc. are in the directory /sas8/samples, in subdirectories labeled base, stat, graph, af, ets, insight, and or. These files must be copied to your own account before you can run them. Locally written example files may be browsed or copied from the /help/unix/statistics/sas/examples directory.
Web Documentation
The Statistical Computing Support web site includes answers to frequently asked questions, as well as information about licensing SAS and renewing your license, other products that might be useful, and links to dozens of other sites that may be of use to you: http://www.itc.virginia.edu/research/statistical.html
The Research Computing Group supports technologically advanced statistical work. http://www.itc.virginia.edu/researchers/
SAS provides assistance via its own Technical Support website. In addition, in versions 8, the Help files included with the program are in HTML (Web) format. The SAS Technical Site http://www.sas.com/ts SAS 8 online documents (help) http://www.itc.virginia.edu/manuals/sas8/onldoc.htm
Consulting Services
Additional assistance with SAS command file construction and statistical routines is available from the Statistical Computing Consultant located in the ITC Research Computing Support Center in Wilson Hall Room 244 (243-8800). The consultants can be contacted via electronic mail to res-consult@virginia.edu. Please note that consulting hours vary by semester, as well as holidays.
For statistical consulting (as opposed to statistical computing consulting), you may wish to contact the Statistics Division of the Math Department (http://www.stat.virginia.edu/uvastat.html). There are no charges for the advice of the faculty consultant, but there is a fee for graduate student consultants ($45 per hour) as well as for the expertise of statistics faculty other than the dedicated consultant ($95 per hour). To find the current faculty consultant, contact the division's secretary, Ms. Kathi Marshall, Halsey Hall room 103, 924-3222.
Table of Contents
- Welcome to Part II
- Obtaining the Files Used in this Tutorial
- Advanced Customizations
- Annotating Syntax
- Computing a New Variable
- Value Labels
- Advanced Data Analysis
- Analysis of Variance
- Scatterplot
- Simple Linear Regression
- Choosing Variable Lists
- Advanced Data Entry
- Importing Excel data
- Elements of a Data Step
- Formats for ASCII
- Exporting to ASCII
- Moving Between OSes
- Advanced Data Management
- Merging Data Sets
- Using Arrays
- Advanced Appendices
- Options for Program Editor
- Methods for Running SAS
- Numerical Accuracy and Representation
- Documentation and Help
- SAS Tutorial
- SAS Manuals
- Sample Syntax
- Web Documentation
- Consulting Services
Welcome to Part II
When we left off last time, we had determined that men and women employed at the bank receive different average salaries, that salaries vary by educational level, and that there are educational differences between men and women in the sample. This suggests that education may account for the gender differences in salary, but we hypothesized that several other factors may also play a role in explaining this co-variation.
In order to fully assess whether education accounts for the salary differences, we will utilize a procedure called Linear Regression. This procedure also allows us to examine the simultaneous influence of other factors that we hypothesized might account for salary differences: age, previous job experience, and tenure at the bank.
We'll also read in raw data and then process commands, whereas before we used a SAS permanent dataset which had already been created for you. And we'll look at several options for customizing output, as well as options for importing and managing your data.
But first we'll look into some ways to customize your SAS data sets. We'll start by looking at an existing command file called course1.sas which includes most of the commands you entered in the first session, plus several new ones we'll cover in this session.
Obtaining the Files Used in this Tutorial
There are several files that have been created for use in this tutorial. The tutorial assumes that the files are saved on the hard drive of your PC in an area named C:/Temp (of course you may choose to save these files elsewhere). If you have problems downloading or using the files using Netscape, using Internet Explorer may resolve the problem.
To save the following files to the C:/Temp area of your hard drive -
Step 1. RIGHT click on the link corresponding to the filename below you want to get.
Step 2. Choose the Save As... option from the pull down menu. When the Save As... dialog box opens, use the navigation tools to designate the C:/Temp area as the save in area.
Step 3. Click the save button to save the file in the C:/Temp area.
Using the above method, save the following files in your C:/Temp directory.
Files needed for Part 2: (assumes you have previously downloaded Part 1 files)
- bankend1.sas7bdat- The SAS dataset (for Part 2)
- readascii.sas - A file containing the SAS commands to read bank.dat
- course2.sas - A file containing the sas commands for part 2 of the tutorial
- exportascii.sas - A file containing the SAS commands to write an ASCII text file
- bank.xls - an Excel file containing the bank data
- course1.sas - A file containing the sas commands from Part 1
- sasdata.dat - a raw data file
Advanced Customization
Annotating Syntax
You've probably already realized that SAS commands are very specific, and that small changes can make large differences -- or prevent you from getting any output at all. So that you'll later understand what you did, and so that you can more easily replicate procedures with other data sets and for other projects, it is advisable to comment your SAS command file -- that is, to add annotations which detail what the written syntax does (or, at least, what you had intended it to do).
Comments are useful for documenting what you are doing and are highly recommended. Comments may be put anywhere in a SAS file as long as they are bracketed by a slash and asterisk at the beginning and an asterisk and slash (reverse order) at the end.
/* one way of adding comments */ /* another way to add comments is to have many lines to type or multiple SAS statements you want to comment out. In this case, you only need to have a slash asterisk at the start and an asterisk slash at the end. Don't need them on each line */
Another comment style is to put an asterisk at the beginning of a line, which is useful in commenting out a command. Note that comments that begin with an asterisk are ended with a semi-colon
* another way of adding comments ; * With this style your comments can go across lines, but not across SAS statements because the semi-colon ends this style of comment, you can only comment out one SAS statement ;
Computing a New Variable
While there are already variables in the data set for job experience (PREVEMP) and tenure (JOBTIME), there is not yet a variable for age. But there is a measure of date of birth, which we can use to create a measure of age. The "command" is similar to what we did to create a "safe copy" of EDUC, but extends the logic: Rather than creating a mere copy, it calculates an age for each respondent based on a simple mathematical expression. This can often be done in one SAS step but here we do it in two. First we set y equal to value given using the YEAR function to strip the year of birth from the bdata variable. Next we determine the value of the variable age with the expression age = 1999 - y.
Note these steps will also create a new SAS dataset from the data set bankend1. Bankend1 contains the modificiations that we did in the previous session. In order to use the library name "sasdir" or any libname on a DATA step, you have to let SAS know about it. You can do this using the mouse as we did in Part 1, or you can use the command shown below.
LIBNAME sasdir "C:/temp" ; DATA sasdir.bankdata ; SET sasdir.bankend1 ; y = year (bdate) ; age = 1999 - y ; PROC FREQ ; TABLES age; RUN ;
Above, line 2 creates a new SAS permanent dataset, and line 3 reads in an existing SAS permanent dataset. Lines 4 and 5 create the new variable AGE, the next two lines request a table for validation, and RUN makes the PROC FREQ execute immediately, rather than "wait" until a second SAS procedure is submitted.
Step 1. Type in the lines of SAS commands above in the Enhanced Program Editor
Step 2. Submit these commands to SAS to execute by using the F8 function key. This does the same thing as pressing the SUBMIT button ( the running dude) on the toolbar.
After any adjustment of data, it is always a good idea to look at a frequency distribution to ensure that changes occurred in the manner expected. This output validates that what you tried to do (create a new variable indicating age) worked:
The same logic could be extended further, using more complex expressions.
For example:
Combining measures of satisfaction:
sat_all = sat_1 + sat_2 ;
Computing a measure of socio-economic status:
ses = (educ * income) / age ;
Computing overall income:
income = SUM (salary, overtime, bonus, stocks) ;
Note: The word SUM is a SAS function, that will add all the listed variables together. If any value for a particular variable for a given case is missing, it will omit that variable for that person, but give a total sum. You should be aware calculating new variables from existing ones what effect missing values will have on your new data values.
Value Labels
We previously created a variable educ2 from educ, but haven't yet made clear (in our dataset or to other users) what the values of educ2 mean. A new format (and would typically come prior to the DATA statement) that we name degree; the second statement ascribes the newly created degree format to the variable educ2 (and would typically come within the DATA step, following the INPUT statement). In our case, we can enter them as follows:
PROC FORMAT ; VALUE degree 1="no HS" 2="no BA" 3="BA or more" ; PROC FREQ ; TABLES educ2 ; FORMAT educ2 degree. ; RUN ;
Note that you MUST end your format name with a period, as shown above degree.
Add, highlight, and SUBMIT these commands to get the following output:
Naming conventions:Many special characters -- such as # @ $ {} -- cannot be used in variable names, value names, or data set names. The value labels may be up to 32 characters long, beginning with a letter (A-Z and a-z) or an underscore (_). Subsequent characters may be numbers (0-9), letters, or underscores. (However, use caution if starting a name with an underscore, because SAS sometimes creates internal variables that start and end with an underscore.) As of Version 8.1, you can have variable names up to 32 characters.
Note that SAS value labels (the words associated with categorical numbers of a variable) are not stored with the SAS dataset, but in a separate SAS binary file called a catalog.
1.4.2 Controlling Output: You can change a variety of system options from their default values, including the page width, page length and line spacing, centering output, and printing the current date on the output. For example:
OPTIONS NOCENTER FORMDLIM = "~" ;
tells SAS to not to center the output and the FORMDLIM tells says that rather than use the default formfeed command to delimit between output sections, use a line of tildes. This option saves paper, but may make output difficult to read when printed on paper, because tables may get broken across pages using this option rather than the default. There are a myriad of options available, consult the documentation for additional information.
You can also specify a title to be printed at the top of each page of output, using the TITLE command. You may specify a title anywhere in your SAS job, and SAS will use that title starting at that point in the job. You can even specify multiple title lines by appending a number to TITLE. You are limited to ten TITLE lines. An example of two title lines is:
TITLE 'My dissertation analysis, phase 1' ; TITLE2 'Print out of those cases which appear miscoded' ;
Advanced Data Analysis
The first two procedures we'll do don't make use of the recoding and labeling we've just done, but help focus on the relationship(s) that we're interested in explaining.
Analysis of Variance
We've already compared the salaries for each gender, but only with univariate point estimates (of mean and standard deviation). We could test our confidence in a relationship more fully by statistically comparing the complete distributions. The following command performs a "t-test" analysis of variance (ANOVA) in salary between the two genders.
PROC TTEST ; CLASS gender ; VAR salary ; TITLE 'Descriptive statistics listed separately for each gender' ; RUN ;
Type the commands above, highlight them, and SUBMIT them. Your output (in the OUTPUT window) should include the following:
How to create an HTML version of this output
To get a HTML version of the output, choose Tools -> Options -> Preferences -> Results , check create HTML , then submit the program. Or you can run the following SAS commands before the SAS commands for which you want HTML-style OUTPUT and then close the HTML-output file after the commmand. For example:
ODS HTML FILE = "c:/temp/ttest.html" ; PROC TTEST ; CLASS gender ; VAR salary ; TITLE2 'Descriptive statistics listed separately for each gender' ; RUN ; ODS HTML CLOSE ;
This output shows some numbers we have seen before -- such as that the mean salary for women is about $26K while the mean salary for men is about $41K, but the distribution of salaries for men is wider (with a standard deviation of about $19.5K rather than $7.5K).
The p-value (<.0001), suggests there is a low risk of being wrong in rejecting the null hypothesis, which confirms our impressions from our look at the data in Part 1, and supports our claim that there is a statistical difference between male and female salaries. But why is there a difference? What about those other factors?
Scatterplot
We've already considered the role of education by grouping cases in cross tabulations. We could also produce a plot to compare individual cases. Let's compare two such plots -- one looking at the role of education (EDUC, not EDUC2), and a second looking at the role of tenure at the bank (JOBTIME) -- in order to compare the role of several factors, controlling for gender.
Type the following commands, and highlight and submit them:
PROC PLOT ; PLOT salary * educ = gender ; TITLE2 'Plot of Salary by Education' ; PLOT salary * jobtime = gender ; TITLE2 'Plot of Salary by JobTime' ; RUN ;
The output should include the following two plots:
The difference is apparent: In the first plot (looking at the relationship between education and salary), there is an apparent relationship across the entire sample, although we can see that few women are above $40K or 16 years of educ. The second plot shows a less clear relationship between job tenure and salary -- and the few women that earn above $40K are not the ones with the longest job tenure. So, education and gender seem to be related to salary, but length of job tenure does not.
Let's try one more factor before proceeding. Add these lines, and highlight and SUBMIT them:
PROC PLOT ; PLOT salary * age = gender ; RUN ;
This third plot shows something very different: Older employees tend to earn less, with the highest salaries going to men in their 40s and the lowest going to women older than 50:
We could investigate a cross tab and plot of each pair of variables. But since we've already shown the possible relevance of at least three explanatory variables, let's turn to a method which allows us to consider all of these variables simultaneously.
Simple Linear Regression
To generate a Linear Regression that addresses our research question, the SAS commands are:
PROC SORT DATA= sasdir.bankdata ; BY gender ; PROC REG ; MODEL salary = educ jobtime preexp age ; BY gender ; RUN ;
The output should include the following:
Several numbers in this output are of import, but you can largely focus on those and ignore many others.
- The "Parameter Estimate" column gives the non standardized coefficients for each variable and the intercept, and suggest how the model may be written as an equation. For instance, for the first model (for females), the predicted regression model is: salary = 8999.35 + 1573.56*educ + 27.99*jobtime + 1.96*preexp - 111.14*age This suggests that, once we have controlled for all of these variables simultaneously, each year of additional education increases women's salaries by about $1573.56, each additional month of job time increases their salaries $27.99, and each level of previous experience adds $1.96 -- but each additional year old the respondent is reduces their salary an average of $111.42. Men, by contrast, get almost three times the benefit for each year of education ($4250.41 vs. $1573.56) five times the benefit for previous experience ($119.62 vs. $27.99), and benefit rather than detriment for age (+$757.82 vs. -$111.14), but detriment rather than benefit for previous experience (-$51.63 vs. +$1.96).
- The "Prob > |T|" column gives the p-value for the test statistic that the parameter estimates are significantly different from what would be expected by chance -- i.e. this column tells you if the above coefficients are statistically significant. Using a critical alpha of 0.05 in the above example, the effects of education and age are significant for both men and women, but the effects of jobtime and previous experience are not significant for either men or women.
- The "R-square" statistic summarizes the fit of the model, in several ways. Here, the value of 0.3179 for the model for women suggests that 31.79% of the variance in salary is explained by these four explanatory variables, and that we reduce our errors in predicting salaries of women 31.79% by knowing values of these other variables. It also tells us that the strength of the model is moderate. (A typical convention here is that an r-square under .10 is weak, one between .10 and .5 is moderate, and one above .5 is strong.)
These regression results indicate that education does have a significant effect on salary. The coefficient for education is significant for both men and women. However, even after controlling for the effects of educational differences between sexes, gender continues to have a significant effect: Each additional year of education worth an additional $1573 for womenand $4250 for men, on average. While some other coefficients are not statistically significant, the coefficients for age are close and also indicate gender disparity: Each additional year of age increases men's salaries by $747 but lowers women's salaries by $111. This suggests the possibility of either some other unmodeled difference between the sexes or sexual discrimination on the part of the bank.
Having completed our analysis of the bankdata data set - let's look at other topics including choosing variables and importing data into a SAS dataset from ASCII and Excel files.
Choosing Variable Lists
Variables are defined in the order they appear in the INPUT statement. After a complete list has been defined in a SAS program, you can use abbreviated variable lists in later SAS statements. If your variable names end in consecutive numbers (e.g., test1, test2, test3, test4) then you may refer to them as a group using a single dash. To refer to the four variables test1 to test4, type: test1-test4. You may even do so in the INPUT statement, provided all of the variables in the abbreviated list have the same format. If your variable names do not end in a number, (e.g., educ, jobcat, salary, salbegin), then you may refer to them in abbreviated form using two dashes. For example, educ--salbegin
refers to all four of the variables listed here. And these shortcuts can be employed in commands, as in the following (which you need not type and submit):
PROC PRINT ; VAR id salary--minority ;
All of the commands up to this point are included in course2.sas, and annotated -- if you want to check what you've done, or save an annotated copy for future reference.
Advanced Data Entry
Importing Excel data
Follow these steps to import the Excel data file bank.xls into a SAS dataset named fromexcl.
Step 1: From the drop down menus, choose FILE, and IMPORT DATA as shown here:
Step 2: Select the Standard file format and then select the Excel type (.xls) from the pull-down menu. Then select Next:
Step 3: Now find the Excel file you wish to import using BROWSE or by typing in the full path. Then click Next:
Step 4: Choose the Library and Member name you wish for the Excel file. Then click Finish:
Step 5: The VIEWTABLE window should open as follows:
If you select the wrong format for the excel file, you'll probably get an error message that's similar to the one below:
Elements of a Data Step
The commands you've issued in Part 1 and most of this Part have used a data set which is already in SAS format -- a SAS dataset, ending in the extension sas7bdat. Frequentlyyou will instead need to include statements to get a raw data set (e.g., bank.dat) and identify its columns and rows as variables and cases, respectively. These statements can also create a permanent data set, such as the one we've been working with, so that yo u could later open the file without having to reference and detail the original bank.dat file.
Although data may be included directly in a SAS command file, it is more often read into SAS from a separate data file. In some lucky cases (such as our first workshop session), the data are already in SAS format. More typically, you will have data in some other format, such as ASCII (raw DOS or Unix text) or an Excel spreadsheet You might also use an editor (such as vi on Unix or the Notepad on Windows) to create such a file.
You use a DATA statement to tell SAS that you will be manipulating data. You use an INPUT statement to tell SAS what names to give each of the variables, as well as the column locations where each value can be found. The following syntax creates the dataset we've been using, from a raw ASCII file called bank.dat:
These commands are available in the file readascii.sas
DATA sasdir.banknew ; INFILE 'c:/temp/bank.dat' MISSOVER ; MISSING X ; INPUT @1 ID 4. @5 GENDER $char1. @6 BDATE MMDDYY8. @14 EDUC 2. @16 JOBCAT 1. @17 SALARY Dollar7.0 @24 SALBEGIN Dollar7.0 @31 JOBTIME 2. @33 PREEXP 6. @39 MINORITY 1. ; RUN ;
The DATA step is where data management activities occur. DATA steps are used to:
- Input data, either from raw data files or previously saved SAS data sets. Variable names and attributes are assigned and read in from user-specified locations.
- Transform data, through calculation, selection, or merging of data from several sources.
- Output data sets or print files. The data sets may be permanent or temporary. Print files may be created to produce reports or tables.
IMPORTANT NOTE: The first part is the DATA statement and a dataset name. This name is not the name of your raw data file. It is a name you are giving to the SAS dataset that SAS will create from your raw data. By having a LIBNAME you are making this SAS dataset a permanent one. If instead, you had typed: DATA bankdata ; Then "bankdata" dataset would be deleted when you exit SAS.
The second part is the INPUT statement where you name your data values and describe their location. The INPUT statement describes the arrangement of the data values for each case or observation in your data file and assigns variable names and formats to those data values. The INPUT statement is used in conjunction with the CARDS command when the data reside within your SAS program file, and with the INFILE command when data are read from an external (non-SAS) file, on disk or tape. (The INFILE statement must occur before the INPUT statement, because it tells SAS what data file to read.) In either case, you may use any or all of the three input styles shown in the next section.
Formats for ASCII Input
Telling SAS how to read a data file is fairly simple, and SAS is flexible about how data can be described. SAS provides a vast array of INPUT options and specifications, only a few of which will be covered in this document.
Note these sas commands assume you have defined the SAS library reference mylib and that you are creating a SAS dataset called mydata. Here the SAS commands are reading character input from the file sasdata.dat
- Column style input, which specifies the columns for each variable using this syntax model:
DATA mylib.mydata ; INFILE 'c:/Temp/sasdata.dat'; INPUT name $ 1-8 gender $ 9 age 13-14 ;
The $ symbol is used to designate an alphabetic variable, also known as a character variable. The name myfile is an eight character or less name you make up for the SAS dataset that is being created from your raw data. In this case, NAME is an eight-character alphabetic variable that occupies columns one through eight.
You may also wish to use the MISSOVER option, which prevents SAS from going to a new input line if it does not find values in the current line for all the given variables; remaining variables are set to missing.
DATA mylib.mydata ; INFILE 'c:/Temp/sasdata.dat' MISSOVER ; INPUT name $ 1-8 gender $ 9 age 13-14 ;
- Fixed format input, which uses a column pointer (@column number) to point to the column where the variable starts, and a format modifier to indicate the width of the data values and/or the number of decimal places.
DATA mylib.mydata ; INFILE 'c:/Temp/sasdata.dat' ; INPUT @1 name $char8. @9 gender $char1. @13 age 2. ;
This tells SAS that NAME is a character variable that begins at column 1 and is eight columns long, while age is a numeric variable that is two columns wide and begins in column 13.
- List format input. You do not need to specify the location of the variables at all. List format, though easiest for which to write an input statement, is the most prone to data errors: There must be at least one blank space between each variable, and missing values must be represented by periods, NOT blanks.
DATA mylib.newfile ; INFILE 'c:/Temp/sasdata.dat' ; INPUT name $gender $age ;
NOTE: This statement will produce errors from our data, because it does not account for missing data nor that there are more than three variables on data line in our data file.
Exporting to ASCII
SAS can write an ASCII text file of data via the FILE and PUT statements shown below. The variables in the PUT statement are written to the file c:/Temp/bank.out and come from the SAS dataset banknew: These commands are available in the file, exportascii.sas which you can download.
DATA _NULL_; /* Write this as is- "_NULL_" is a SAS keyword */ /* for an unnamed dataset */ FILE 'c:/Temp/bank.out' NOPRINT NOTITLES; /* You supply the filename - SET banknew ; /* notice no SAS LIBNAME is used. This will make this SAS dataset a temporary dataset that will be deleted when you exit SAS */ PUT gender salary jobcat; /* List variables to export in the PUT statement*/ /* Can use any FORMAT statements, as in an INPUT statement */ RUN ;
Moving Between OSes
The following command file can be used to move SAS data from one operating system to another. Explanations and comments for the procedure are included in comment lines at the bottom of the file.
/*********************************************************************/ /* FILE is called /help/unix/statistics/sas/examples/export.sas */ /* By: Tim FJ Tolson, User Support, ITC, University of Va. */ /* ***** See documentation/explanations below ******* (6/92)*/ /*********************************************************************/ OPTIONS LINESIZE=80; TITLE 'Export SAS dataset to transport DATA set' ; LIBNAME saslib '.'; LIBNAME transp XPORT 'transp.sasport'; PROC CONTENTS data=saslib._all_ ; /* Use one PROC MEANS DATA = sasdataset ; statement for EACH dataset */ PROC MEANS DATA = saslib.tests MAXDEC=4 ; PROC MEANS DATA = saslib.sexmean MAXDEC=4 ; PROC COPY IN = saslib OUT=transp ; SELECT tests ; /*----------*****EXPLANATION/DETAILS ******----------------------------*/ /* Use SELECT if have several SAS membernames in */ /* the SAS library and only want to move one */ /* The first LIBNAME statement reads all of the SAS datasets from the */ /* default (current) directory. The Second LIBNAME statement points */ /* to the directory and FILE that the transport dataset(s) will be */ /* written into. */ /* If there are more than one dataset in the default directory than all */ /* the datasets will be written into the ONE transport file. If you */ /* want to write separate datasets, use the SELECT option on the PROC */ /* COPY command. */
/* If there are datasets in other directories, use additional LIBNAME */ /* statements to point to those directories and use additional LIBNAME */ /* statements to point to additional transport files and then issue */ /* additional PROC COPY commands. */ /* */ /* *** VERIFYING and CHECKING your data for transfer: */ /* The PROC CONTENTS lists all the information about the dataset. */ /* It gives the number of cases and variables, the variable names, type, */ /* any formats, informats, and labels. */ /* PROC MEANS gives the Mean, Standard Deviation, Minimum & Maximum */ /* value for each variable in the dataset. */ /* Use these two pieces of output to compare the original dataset to the */ /* dataset AFTER you imported on your new computer system to make sure */ /* the data tranferred correctly. It's one in 10,000 that it does not, */ /* but you don't want to be the one! */ /* */ /* **** IMPORTANT TRANSFER INFORMATION **** */ /* After you have created the SAS transport dataset you are ready to */ /* move it to a new computer system. They transfer MUST BE MADE in */ /* BINARY MODE! */ /* Use the FTP command: */ /* BINARY. */ /* When you IMPORT the dataset, use the PROC CONTENTS and PROC MEANS */ /* commands to generate the same information generated by this file and */ /* compare the results to make sure the data transferred correctly. */ /* ***********************************************************************/
Advanced Data Management
Merging Data Sets
SAS allows you to merge sorted data sets in two ways: by order and according to the value of a specified variable common to both data sets. This is done using a MERGE statement in a DATA step. The MERGE feature is especially useful for dealing with hierarchical data. Merging according to the value of a specified variable creates a data set where the data for the upper level is duplicated for each member on the lower level of the hierarchy. For example, if you had two data sets containing economic data, one for STATES and another for CITIES, and if both data sets contain a variable STATE, merging the data sets by STATE would create a new data set with one observation for each city containing the data for that city as well as the data for the state in which the city is located.
The SET statement in a DATA step will allow you to copy one data set into another with modifications. Variables may be discarded, recoded, renamed or left the same. For details see the SET, DROP, KEEP and RENAME statements in the SAS Language Usage manual, and the Language Reference manual.
If you will be merging data sets, either to add cases or to add variables, you are encouraged to read our on-the-web merging tutorial:
http://www.itc.virginia.edu/research/indepth/merging.html
Using Arrays
SAS gives you the choice of combining data of the same type (character or numeric) into an array. This is advantageous when similar transformations must be performed on many variables. For example, the SAS command file below reads in grades then recode s ones to zeros using an array:
DATA gradeval ; INFILE grades85 MISSOVER ; /* DEFINE THE ARRAY*/ ARRAY question{10} q1-q10; /* Question is the arrayname and 10 is the number of elements */ /* q1-q10 are the array elements */ INPUT @40 q1 BZ1. q2 BZ1. q3 BZ1. q4 BZ1. q5 BZ1. q6 BZ1. q7 BZ1.; /* The BZ1. is a SAS informat for a value 1 column wide */ /* and to transform blanks to zeros */ /* Using a DO WHILE loop to recode ones to zeros*/ i=1 ; DO WHILE (i LE 10); IF (Question[i]=1) THEN Question[i]=0; i=i+1 ; END ; /* end the do while loop */ DROP I ; /* get rid of the i counter variable from the data set */ RUN ;
Advanced Appendices
Options for Program Editors
SAS for Windows has many ways to select/specify commands. You can use:
- Command line, which is intended for experienced SAS users. Use of the command lines require knowledge of the SAS command set.
- Command Bar, the recommend choice, intended as a navigator and for novice users.
- Command Box, a moveable command dialog box.
- Pop-up menus are always available. In any window, clicking the right mouse button will open "pop-up" menus matching the appropriate menus available for that window.
For more information about the PROGRAM EDITOR window Preferences, consult SAS Companion for Microsoft Windows Environment.
Methods for Running SAS
SAS can be run in three different ways. SAS can be run in two different modes: interactive mode and production mode. In production (or batch) mode, you prepare a file of SAS commands and execute this file after it is written; nothing happens until y ou run the prepared commands. Interactive mode means using the SAS display Manager to type, edit, and run SAS commands. In addition, in interactive (window) mode, you can chose to either use the pull-down menus to select your commands, or you can type the m into a program window. We will demonstrate all of these methods of entering commands in SAS for Windows or Macintosh.
The preferred method of running SAS on the Unix computers is the non-interactive mode, in which you prepare a file of SAS commands and then invoke SAS to execute this file of commands. Please consult the Unix document, U-025 SAS on the RS/6000 for further details.
Numerical Accuracy and Representation
Computers do not store or work with numbers in the same manner as you and I. All numbers are represented in binary form in a computer. One practical result of this system of representing numbers is that numbers that are integers to us, e.g. 1 or 7 or 88, are real numbers to the computer, e.g. 1.0000000000232123 or 79.9999999999999987.
Because of this numerical representation, if you typed: IF (anyvar=1) newvar=20 the value of newvar may or may not be 20 for those cases where anyvar was equal to 1. The computer cannot store the integer "1" and make an exact match. It has stored 1.0 0000000000132 or maybe 0.9999999999999778, neither of which equals 1.0000000000000000.
Thus when you are making numerical comparisons or recodes, make sure that your categories include all possible real numbers. For example:
IF educ2 GT 0 AND educ2 LT 12 THEN educ2 = 1 ; ELSE IF educ2 GE 12 AND educ2 LE 15 THEN educ2 = 2 ; ELSE IF educ2 GE 16 THEN educ2 = 3 ;
Documentation and Help
SAS Online Documentation offers easy access to the most frequently used SAS documentation (previously available only in print), including news about SAS components that are shipped as experimental or beta. SAS for Windows includes pull-down help as wel l as ASSIST menus and dialogue boxes. You can also use the HELP command from the SAS for Windows command line.
SAS Tutorial
In SAS version 8.2 the online tutorial may be accessed from the Help drop-down menu by selecting the SAS Online Tutor under Books and Training. SAS Institute provides an on-line computer-based training (CBT) tutorial. The SAS/TUTOR module is licensed and available for SAS for Windows and SAS on the RS/6000s. In order to use this program you need to obtain the SAS/TUTOR training notes, which are available for purchase at the University Bookstore's PROFS Publishing. The cost is based on the cost of Profs Publishing photocopying the original notes. (If you have questions about getting a copy of these notes, please e-mail res-consult@virginia.edu ) Once you have these notes, you can invoke the SAS/TUTOR module.
In SAS for Windows, version 8 SAS/TUTOR is invoked by starting up SAS, then selecting, Online Training from the Help menu.
In SAS on Macintosh, double-click on the SAS/Tutor icon in the SAS folder.
You may also want to look at the use of SAS/ASSIST for creating a command file by using the pull-down menus and selecting the commands needed in their appropriate order.
SAS Manuals
SAS Institute, Inc. publishes a large library of manuals and statistical procedure guides. Some of these are available in the trade books section of University of Virginia bookstore. All of the manuals listed may be purchased directly from the SAS Inst itute, Inc., or may be ordered through any bookstore. They are also available, for reference use only, at the ITC Research Computing Support Center (RCSC), Wilson Hall Room 244. We have a comfortable "reading area" at the RCSC where you can browse a manual as well as get assistance. There are manuals in the RCSC that can be checked out for up to 24 hours. Speak with the computing consultant in Room 244 Wilson in order to check out a manual.
Also all the SAS manuals are available on one CD-ROM that you can get from the RCSC for your use or you can assess all of these SAS "OnLineDoc'' at our web site:
http://central.itc.virginia.edu/manuals/sas8/onldoc.html
Sample Syntax
Another aid to understanding SAS may be obtained by looking at sample programs provided by SAS. The programs come complete with data, and may be examined for ideas on how to set up a procedure, or may be run so that the output of the program may be studied.
The location of the SAS Institute, Inc. example files for PC SAS and SAS for Windows depends on the choices made during installation of these products on your module. In general, they are in the SAS subdirectory along with the module to which they per tain. For example, SAS Institute, Inc. sample files for the STAT module are generally located in: /SAS/STAT/SAMPLE, whereas the sample files for the ETS module would be in: /SAS/ETS/SAMPLE.
On the RS/6000s, sample files from SAS Institute, Inc. are in the directory /common/sas82/rs6000/samples, in subdirectories labeled base, stat, graph, af, ets, insight, and or. These files must be copied to your own account before you can run them. Locally written example files may be browsed or copied from our SAS web page at:
http://www.itc.virginia.edu/research/sas/home.html#examples
Web Documentation
The Research Computing Group supports technologically advanced statistical work via our Researchers website.
SAS provides assistance via its own Technical Support website. In addition, in versions 8, the Help files included with the program are in HTML (Web) format.
SAS 8 online documents (help): http://central.itc.virginia.edu/manuals/sas8/onldoc.htm
Consulting Services
Additional assistance with SAS command file construction and statistical routines is available from the Statistical Computing Consultant located in the ITC Research Computing Support Center in Wilson Hall Room 244 (243-8800). The consultants can be contacted via electronic mail to res-consult@virginia.edu. Please note that consulting hours vary by semester, as well as holidays.
For statistical consulting (as opposed to statistical computing consulting), you may wish to contact the Statistics Department (http://www.stat.virginia.edu/consulting.html).
There are no charges for the advice of the faculty consultant, but there is a fee for graduate student consultants as well as for the expertise of statistics faculty other than the dedicated consultant. To find the current faculty consultant, contact the division's secretary, Ms. Brenda Crider (103 Halsey Hall, phone: 924-3222).