What is a Dataframe?
The idea in one sentence
A data frame is a table where each row is one observation and each column is one variable, designed so you can ask and answer questions with code.
Why data frames matter
Most analysis tasks become easier when your data is in a consistent table:
- You can filter (keep certain rows)
- You can select (keep certain columns)
- You can summarize (compute counts, averages, percentages)
- You can group (compare subgroups fairly)
In other words: clean data frames turn “I wonder…” into “I can test that.”
A tiny example
Below is a simple data frame that looks like exam/course data.
library(dplyr)
df <- tibble::tibble( student_id = c(101, 102, 103, 104, 105, 106), term = c("Fall", "Fall", "Fall", "Spring", "Spring", "Spring"), course = c("MSK I", "MSK I", "Neuro I", "MSK I", "Neuro I", "Neuro I"), final_grade = c(92, 68, 85, 74, 88, 61) )
dfHow to think about the structure
Rows: “What is one row?”
In this example, one row is one student in one course in one term.
That choice matters: it determines what questions you can answer cleanly.
Columns: “What is one column?”
Each column should represent one concept:
student_ididentifies whotermdescribes whencoursedescribes whatfinal_gradeis the outcome we care about
Questions you can answer (and the code patterns)
1) How many rows are in the dataset?
nrow(df)2) What are the column names?
names(df)3) What is the average final grade overall?
df %>%
summarize(avg_grade = mean(final_grade))4) What is the average grade by course?
df %>%
group_by(course) %>%
summarize(
n = n(),
avg_grade = mean(final_grade)
)5) Which students are below 70?
df %>%
filter(final_grade < 70)A simple “data frame checklist”
Before analyzing, ask:
Is one row one thing? (one student-course-term? one exam attempt? one clinic visit?)
Are columns variables, not sentences?
Are values consistent? (e.g.,
FallvsfallvsFA)Do you have an ID? (student, course, exam, location, etc.)
Are missing values handled intentionally?
Takeaway
A good data frame is less about “formatting” and more about making your questions computable.
If you choose the right row definition and consistent columns, your analysis becomes a set of repeatable steps.