The apply() family consists of functions that allow you to apply operations across different dimensions of your data without writing explicit loops. These functions are essential for efficient data analysis in R and are particularly useful when working with scientific datasets.
The apply() function works on matrices and arrays, applying a function across rows or columns.
Syntax: apply(X, MARGIN, FUN, ...)
X: matrix or arrayMARGIN: 1 = rows, 2 = columnsFUN: function to apply# Temperature measurements from 4 weather stations over 5 days
temperature_matrix <- matrix(c(23.1, 25.4, 22.8, 24.2, 26.1,
21.3, 23.7, 20.9, 22.5, 24.3,
25.6, 27.2, 24.8, 26.1, 28.4,
22.7, 24.9, 21.5, 23.3, 25.8),
nrow = 4, ncol = 5,
dimnames = list(c("Station_A", "Station_B", "Station_C", "Station_D"),
c("Day1", "Day2", "Day3", "Day4", "Day5")))
# Calculate average temperature for each station (across columns)
station_averages <- apply(temperature_matrix, 1, mean)
# Calculate daily averages across all stations (across rows)
daily_averages <- apply(temperature_matrix, 2, mean)
# Find maximum temperature recorded at each station
max_temps <- apply(temperature_matrix, 1, max)lapply() applies a function to each element of a list or vector and returns a list.
# List of experimental measurements from different trials
trial_data <- list(
trial_1 = c(12.3, 11.8, 12.7, 11.9, 12.1),
trial_2 = c(13.1, 12.9, 13.3, 12.8, 13.0),
trial_3 = c(11.7, 11.2, 11.9, 11.5, 11.8),
trial_4 = c(12.8, 12.4, 12.9, 12.6, 12.7)
)
# Calculate mean for each trial
trial_means <- lapply(trial_data, mean)
# Calculate standard deviation for each trial
trial_sds <- lapply(trial_data, sd)
# Apply a custom function to calculate coefficient of variation
cv_function <- function(x) {
(sd(x) / mean(x)) * 100
}
trial_cv <- lapply(trial_data, cv_function)sapply() is similar to lapply() but tries to return a simpler data structure (vector or matrix instead of list).
# Using the same trial data
# Get means as a named vector instead of a list
trial_means_vector <- sapply(trial_data, mean)
# Calculate multiple statistics at once
trial_stats <- sapply(trial_data, function(x) {
c(mean = mean(x),
sd = sd(x),
min = min(x),
max = max(x))
})
# Working with gene expression data
gene_expression <- list(
gene_A = c(2.1, 2.3, 1.9, 2.2, 2.0),
gene_B = c(1.5, 1.7, 1.4, 1.6, 1.5),
gene_C = c(3.2, 3.1, 3.4, 3.0, 3.3)
)
# Check if any genes are upregulated (mean > 2.0)
upregulated <- sapply(gene_expression, function(x) mean(x) > 2.0)mapply() applies a function to multiple lists or vectors simultaneously.
# Experimental conditions: temperature and pH levels
temperatures <- c(20, 25, 30, 35)
ph_levels <- c(6.5, 7.0, 7.5, 8.0)
reaction_times <- c(10, 15, 20, 25)
# Calculate reaction efficiency based on multiple parameters
reaction_efficiency <- mapply(function(temp, ph, time) {
# Simplified efficiency model
efficiency <- (temp * ph) / time
return(efficiency)
}, temperatures, ph_levels, reaction_times)
# Create experimental labels
experiment_labels <- mapply(function(t, p, time) {
paste("T", t, "_pH", p, "_", time, "min", sep = "")
}, temperatures, ph_levels, reaction_times)tapply() applies a function to subsets of a vector based on grouping factors.
# Plant growth data with different treatments
plant_heights <- c(15.2, 16.1, 14.8, 15.9, 12.3, 11.8, 12.7, 11.9,
18.1, 17.8, 18.5, 17.9, 13.2, 12.9, 13.4, 13.1)
treatment_groups <- factor(rep(c("Control", "Fertilizer_A", "Fertilizer_B", "Drought"),
each = 4))
# Calculate mean height for each treatment group
treatment_means <- tapply(plant_heights, treatment_groups, mean)
# Calculate standard error for each group
treatment_se <- tapply(plant_heights, treatment_groups, function(x) {
sd(x) / sqrt(length(x))
})
# Working with multiple grouping variables
species <- factor(rep(c("Species_1", "Species_2"), each = 8))
habitat <- factor(rep(c("Forest", "Grassland", "Forest", "Grassland"), each = 4))
# Calculate mean heights by both species and habitat
species_habitat_means <- tapply(plant_heights, list(species, habitat), mean)# Check for outliers in sensor data
sensor_readings <- list(
sensor_1 = c(23.1, 23.3, 23.2, 45.7, 23.0), # Contains outlier
sensor_2 = c(22.8, 22.9, 22.7, 22.8, 22.9),
sensor_3 = c(23.5, 23.7, 23.4, 23.6, 23.5)
)
# Identify potential outliers using IQR method
outlier_check <- lapply(sensor_readings, function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
outliers <- x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR)
return(which(outliers))
})# Log-transform concentration data
concentrations <- list(
sample_A = c(1.2, 2.1, 1.8, 1.5, 1.9),
sample_B = c(0.8, 1.1, 0.9, 1.0, 0.7),
sample_C = c(2.5, 2.8, 2.3, 2.6, 2.7)
)
# Apply log transformation to normalize data
log_concentrations <- lapply(concentrations, log10)
# Convert back to vector format if needed
log_conc_vector <- sapply(log_concentrations, identity)The unlist() function is the standard way to convert a list to a vector:
# Simple list of measurements
ph_measurements <- list(6.2, 6.8, 7.1, 6.9, 7.3)
# Convert to vector
ph_vector <- unlist(ph_measurements)
# For lists with multiple elements per component:
temp_data <- list(
sensor_A = c(23.1, 23.3, 23.2),
sensor_B = c(22.8, 22.9, 22.7),
sensor_C = c(23.5, 23.7, 23.4)
)
# Flatten all values into one vector
all_temps <- unlist(temp_data)# Combine list elements
combined_data <- do.call(c, temp_data)# Keep names
with_names <- unlist(conditions)
# Remove names
without_names <- unlist(conditions, use.names = FALSE)
# Or
no_names <- as.vector(unlist(conditions))The apply() family eliminates the need for explicit loops, making your code more readable and often more efficient for scientific data analysis tasks. Choose the appropriate function based on your data structure and desired output format.