DataFrames are generic data objects of R which are used to store the tabular data. A data frame is a table or a two-dimensional array-like structure in which each column has values of one variable and each row contains one set of values from each column. Dataframes are integrally important to using R for any kind of data analysis. One of the most frustrating aspects of R for new users is that, unlike Excel, or even SPSS or Stata, it is not terribly easy to look at and modify data in a spreadsheet like format.
Following are the characteristics of a data frame.
- The column names should be non-empty.
- The row names should be unique.
- The data stored in a data frame can be of numeric, factor or character type.
- Each column should contain the same number of data items.
edit and fix
R provides two ways to edit an R dataframe (or matrix) in a spreadsheet like fashion. They look the same, but are different! Both can be used to look at data in a spreadsheet-like way, but editing with them produces dramatically different results.
The first of these is edit, which opens an R dataframe as a spreadsheet. The data can then be directly edited. When the spreadsheet window is closed, the resulting data frame is returned to the user. This is a reminder that it didn’t actually change the object. In other words, when we edit a dataframe, we are actually copying the data frame, changing its values, and then returning it to the console. The original mydf is unchanged. If we want to use this modified dataframe, we need to save it as a new R object.
The second data editing function is fixed. This is probably the more intuitive function. Like edit, fix opens the spreadsheet editor. But, when the window is closed, the result is used to replace the data frame. Thus, fix(mydf) replaces mydf with the edited data.
editing and fixing can seem like a good idea. And if they are used simply to look at data, they’re a great additional tool (along with summary, str, head, tail, and indexing).
Creating Data Frames
Data frames are usually created by reading in a dataset using the read.table() or read.csv(). However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists.
Adding on to Data Frames
We can leverage the cbind() function for adding columns to a data frame. One of the objects being combined must already be a data frame otherwise cbind() could produce a matrix.
Adding Attributes to Data Frames
Similar to matrices, data frames will have a dimension attribute. In addition, data frames can also have additional attributes such as row names, column names, and comments.
Subsetting Data Frames
Data frames possess the characteristics of both lists and matrices; if we subset with a single vector, they behave like lists and will return the selected columns with all rows; if we subset with two vectors, they behave like matrices and can be subset by row and column.
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.
To retrieve data in a cell, we can enter its row and column coordinates in the single square bracket “” operator. The two coordinates are separated by a comma. In other words, the coordinates begin with row position, then followed by a comma, and ends with the column position. The order is important.
Operations that can be performed on a DataFrame are:
- Creating a DataFrame
- Accessing rows and columns
- Selecting the subset of the data frame
- Editing dataframes
- Adding extra rows and columns to the data frame
- Add new variables to dataframe based on existing ones
- Delete rows and columns in a data frame