23 Principles of functional programming
Summary
The tidyverse is a collection of packages in R.
The tidyverse is an example of an API that is a collection of functions designed to connect to other hardware or software. In the case of the tidyverse, the functions make it easier to access the functionality of base R.
One of the main principles of the tidyverse is the embrace of functional programming principles.
Functional programming includes making programming functions like mathematical functions, immutability of variables, and the ability to pass functions to other functions.
The tidyverse also encourages using functions like words to build sentences. That way functions only need accomplish narrow tasks, and the overall system retains flexibility.
23.1 Application programming interface
An API (application programming interface) is a collection of functions and tools that allow the creation of applications that access other base functionality.
For instance, there are API’s for
accessing the operating system,
accessing a graphics card,
accessing a hard drive,
and in the case of the tidyverse,
- accessing the base functions of R.
The purpose of an API is to make life easier for both the programmer who must create code, the updater who must maintain the code, and the end user who receives the output of the code.
In the case of the tidyverse, Hadley Wickham had four principles in mind when creating the packages that comprise the system.
It should reuse existing data structures.
Compose simple functions with pipes to make more complicated constructs.
The API should embrace functional programming.
It should be designed with humans in mind.
Let’s look at each of these principles in turn.
23.2 Reusing existing structures
Data has been collected for millennia. While a census count of France in the 17th century might only be of interest to historians, data from ten or even a hundred years ago is often still of great importance today.
Therefore it is important that any tools work with systems for organizing information that already exist. In the context of R, that means that any new tools should work within the context of the data frame, which is the primary data type for storing data in R.
That is why the tibble, the preferred data storage form in the tidyverse extends the data frame rather than replacing it. Any package in the tidyverse can take as an argument either a tibble or a data frame,
23.3 Pipes make code easier
No one can hold a complex series of transformations entirely in their heads. Pipes give us a semantic way of breaking down such a series into their component parts. That way we can handle large tasks one step at a time.
So what does this mean when we begin writing functions of our own? There are a couple things to keep in mind
Keep functions simple. That means it should have as few inputs as possible, and return only one thing. That makes it easy to chain functions together using pipes.
Function names should be verbs when possible. That makes the piped code easier to read. The function
filterfilters out observations,selectselects variables, and so on.
Of course, these are guidelines, not hard and fast rules. Most of the geom_ functions in ggplot2 for instance, are nouns rather than verbs, because they are adding a particular thing to the canvas.
23.4 Use functional programming
This is a big one, and so will take some explanation. There are several types of programming paradigms. Three of the most common are as follows.
Imperative programming Here the focus of the programmer is how to modify the state of the system in order to accomplish a task. Most commands are destructive, they remove an existing portion of the state and replace it with a new one. The Turing machine is the canonical example of imperative programming.
Functional programming Here the focus is on listing the transformations needed to get from the current state to the final state. Commands are non-destructive, they indicate what to do to the existing state to move it in the desired direction. The Lambda Calculus is the canonical example of functional programming.
Event programming Here functions are triggered by outside events. This type of programming is useful for game design or making user interfaces for data analysis.
Note that neither of these paradigms is “right” or “better”. Instead, they have different strengths and weaknesses that encourage the user to think about their problem and write code to solve it in different ways.
Most procedural and object-oriented languages are imperative. On the other hand, since a statistic is a function of the data, many statistical analyses have a clearer form when written as a functional program.
So what makes a language functional? There are several properties that a functional language must have.
Functions are mathematical functions, also known as pure functions.
Variables are immutable, meaning they cannot be changed once assigned. This leads to referential transparency, where each variable name returns a unique value.
Recursion is used for loops.
Functions are First-Class and can also be Higher-Order. This means you can pass functions as input to other functions.
23.4.1 Functions are mathematical functions
In imperative programming, a function can either be like a mathematical function (for example: \(y = x^2\)) or it can be a set of commands that alters the state of the system in a destructive fashion.
In a programming language, a function is pure if it always produces the same output with the same input, and if there are no side-effects. No side-effects means that the function does not change the value of the input variables or any global state.
In functional programming, functions are all just mathematical functions. The following code in R incorporates the function \(y(x) = x^2\). It returns one thing, the output of the function.
Note that R is designed to encourage this type of programming by only allowing the return of a single object.
23.5 Variables are immutable
In a functional program, it is not permitted to change the value of a variable! Once you have assigned a variable, you cannot change its value.
:::: {.defn data-latex=“} Say that variables in a programming language are immuatble if they can only be assigned once. ::::
This is again to bring variables in line with how variable names are used in mathematics. For instance, if I write \[\begin{align*} y &= x^2 \\ y &= -|x| - 2, \end{align*}\] that does not make any sense, as \(y\) cannot be both of those things simultaneously.
Of course, R does enforce variable immutability. It happily allows you to change the values of variables within a function or use functions to change the variable values.
So if you are going to use this principle, you will have to do it yourself. That means writing code like
instead of
So why make variables immutable? It leads to a great advantage of functional programming called referential transparency.
A programming language has referential transparency if every assigned variable has the same value throughout the program.
In other words, a particular name only references one value of the variable. That means that when writing code, you never have to worry about the same variable being used in two different ways, or the same function name being used for two different functions. You are guaranteed that the result will always stay the same.
This prevents you from accidentally changing the value of a variable and then expecting it to be the same as it was before. Or if you are collaborating in writing code on a large project, it prevents you from changing a variable in one part of the code that you are working on, thereby breaking code that your collaborator had finished.
23.6 Recursion is used instead of loops
But wait a minute, one of the most common constructions in programming languages is the loop, which executes a series of commands more than once. For instance, consider the following snippet of C code:
#include <stdio.h>
int main () {
int a, s = 0;
/* for loop execution */
for( a = 10; a < 20; a = a + 1 ){
s = s + a;
}
printf("%d\n",s);
return 0;
}I don’t want to get too much into the details of this code, but I will say that this code calculates \(\sum_{a = 10}^{19} a = 145\). It does this by keeping track of the sum at each stage of the computation, and changing the variable at each step. So what can go wrong? Well, suppose that I had a bug in my code:
#include <stdio.h>
int main () {
int a, s = 0;
/* for loop execution */
for( a = 10; a < 20; a = a + 1 ){
s = s + a;
a = a - 1;
}
printf("%d\n",s);
return 0;
}Inside the for loops, the value of a is being reduced by one at each step, so in the execution of the for loop, it undoes the addition of 1 to a. This code will never stop, it will run forever!
That’s bad! The good news is that functional languages cannot change the values of variables once assigned, so this type of bug cannot arise in a functional language.
But of course that raises the question of how exactly to run a for loop? The answer is that they use recursion to solve this type of problem.
A function is defined recursively if it includes itself in the definition.
Let’s see how we could build that same for loop using recursion. To do this, let’s make the function a bit more general. Say that \[ s(n) = \sum_{a = 10}^n a. \] Then mathematically we can define \(s(n)\) recursively as follows: \[\begin{align*} s(10) &= 10 \\ s(n) &= n + s(n - 1) & & \text{when } n > 10. \end{align*}\]
Note that like in induction proof, we have a base case \(s(n) = 10)\) and a recursive case \(s(n) = n + s(n - 1).\) From this description it is possible to directly build recursive code.
Note that we never had to redefine a variable in this program! Now, that being said, R does have a for loop, mainly because recursion is much slower in practice. However, it is partially recursive, in the sense that what happens inside a particular execution of the for loop stays in the for loop. That means that it is not possible to create the bug that we saw in C where the for loop variable was altered resulting in an infinite loop.
Consider the following R code:
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
Because the for loop is written using sets rather than operations, it is just not possible to recreate the bug in R that the C code had.
23.7 Functions are first-class and higher order
In R, the assignment operator <- is used to assign the function to a particular name. So functions and variables are really the same type of object. This is what we mean by a function being first-class.
In a programming language, a function is first-class if it is treated like any other variable.
Since it is treated like any other variable, we can take functions as input to another function, and return functions as results. We call functions like these higher-order.
A function which takes a function as input, or returns a function as output, it called higher-order.
This is especially useful with the ggplot function, which takes a parameter mapping which is set equal to a function aes with its own parameters.
23.8 Functional programming and data science
That is functional programming in a nutshell. So how does that relate to R and data science?
R itself is not a fully functional language, but it incorporates enough features of functional languages that it is possible to do functional programming. Sticking to this paradigm is very helpful both for code readability and in large collaborative projects.
Functional programming fits in very nicely with the data science view that we are transforming our data to make patterns obvious. Many languages such as Haskell used in data science are fully functional languages for this reason.
23.9 Designing the API for humans
The last principle for the design of the tidyverse is that it will be used by humans. Previous chapters have not talked much about computational complexity. That is partially because that would lead us deeper into the algorithm for accomplishing tasks than we plan to go here, but also because in practice most of the difficulty of data analysis comes from the human time, not the computer time.
Therefore, it is essential that you make your analysis as transparent as possible to humans, sometimes even at the cost of making the code slower.
This also informs the choice of function names. For instance, the geometry functions all begin with geom_. This makes them easier for people to remember, and also has the added benefit of making the autocomplete more powerful, as a user can scroll through a set of possibilities in order to decide what is appropriate.
In naming your functions, do not be afraid to have a lengthy name if the description power of the name is needed. Save short names for functions that will be used very often, and then overall your code will be much easier to read and use by others.