Contents
Many programs that we write can be embedded in recurrent neural networks (RNNs). For such programs, a trained RNN can perform better than if we write the algorithm by hand, refining it via trial and error. I walk through an example in detail.
Introduction
Humans are bad at managing spaghetti code. Of course, we should try and avoid writing spaghetti code if we can. But there are problems that are so illspecified that any serious attempt to solve them results in just that.
Let me make this concrete. In research projects, we often write programs that extract information from raw data. The data may have idiosyncrasies and follow no clear specification. Some examples:
 Identify mentions of corporations and their officers in news articles.
 Label public procurement contracts by what kind of service the firms supply.
 Determine if a message exchanged between engineers contains program code.
If we want the output to be perfect—at least in the limit—then we need to exhaustively examine each observation. We can go one step further to reassure ourselves that the program produces the correct output by picking representative examples and writing unit tests for them. Almost every programming language has libraries for this, including R and Python.
But this approach won’t work well if getting the correct output requires that we specify complicated decision rules. In the examples above,
 corporate officers might be mentioned in the news by nickname,
 contracts might describe the procured services using synonyms, and
 it might not always be trivial to tell apart program code from English.
In such situations, we might be better off training a neural network. Algorithms that train neural networks thrive on spaghetti.
Detecting program code in messages
In this post, the problem that I’ll walk through solving is this: how can we detect if a message sent during code review explicitly refers to program code? Let’s suppose that the code base that we observe is written in C, and that the following examples are representative of the messages that engineers send:
 LGTM with render_ipa_alloc().
 If the FTPSACK flag is set, then use a prespecified value.
 AFAICT there is nothing else to check (unless you can think of something).
 Actually, debug_error() doesn’t return NULL, so we should use IS_ERROR() here.
 This fails to build on aarch64 even though it works without issue on amd64.
 I’ve added if (err) goto cleanup; but the code still leaks.
The highlighted expressions are programcode references that we would like to detect.
Ideas for decision rules
We take a straightforward approach and look for decision rules that can tell apart program code from ordinary English. Here are some simple ideas:
#  Rule  True positive  False positive  False negative 

1  Word followed by parentheses is code.  A and D  –  B and F 
2  Allcaps word is code.  B and D  C  A and F 
3  NonEnglish word is code.  A, B, and D  C, E, and F  – 
None of these rules is perfect: each has false positives or false negatives or both.

Rule 1 is easy to implement but it misses obvious positive cases like
if (err) goto cleanup;
.  Rule 2 mistakenly classifies capitalized acronyms as program code. To make this rule effective, we need to complement it with an extensive list of acronyms and abbreviations that we might encounter (e.g., AFAICT, LGTM, USD, COVID).
 Rule 3 mistakenly classifies engineering jargon as program code. To make this rule work, we need an even longer word list than for Rule 2. We can start with a canonical list of English words, but we need to complement it both with acronyms and with words that commonly appear in software engineering, like aarch64 and amd64.
A handwritten algorithm
Nevertheless, we reason that a simple algorithm might do well enough. So we set out to write a program that implements Rule 1. Our program will decide in two steps whether a message contains code:
 Preprocessing: convert the message into a sequence of tokens. The tokens are chosen so that they capture syntactic elements of C program code. This is the information that we need to apply Rule 1.
 Inference: apply a function to the token sequence to check if the sequence satisfies the rule. If it does, then we conclude that the message contains valid C code.
Let’s take the first example sentence from before. We could encode it as a sequence of eight tokens:^{1}
Rule 1 says that the token sequence underscore_identifier
–open_paren
–close_paren
indicates the presence of program code in the sentence.
Let’s write a classifier in Python that detects a sequence like this.
First, we set up the types.
For simplicity, we model tokens as strings.
In order to keep track of the tokens that we have seen in the sequence, we also define a data class called State
:
1
2
3 = # E.g., "all_caps_identifier".
4
5
6
7 : = False
8 : = False
9 : = False
10 : = False
To classify a message, we write a function called contains_code
that returns True
if the message satisfies Rule 1.
The function decides this by iterating over the token sequence and checking at each element whether the rule applies.
Since applying the rule means checking three consecutive tokens, not just one token, we need to keep a memory of what tokens we have seen earlier in the sequence.
The function contains_code
keeps this memory inside an instance of State
:
12
13
14
15 =
16
17
18 =
19
20 return
Now we are ready to implement process
.
This is where we check if the rule applies and maintain the state:
22
23 """Process one element of the token sequence, taking into account
24 what we have seen earlier in the sequence."""
25
26
27 return
28
29 # Rule 1: an identifier followed by opening and closing parens is
30 # program code.
31 #
32 33 34
35 = True
36 return
37
38 # Housekeeping: take note if we see identifiers and opening parens
39 # in case the next token will be a closing paren.
40 #
41
42 = 43 44
45
46 = in 47 48 49 50
51 = ==
52
53 return
And with that, we have a classifier that implements Rule 1. Let’s see how well it performs by looking at its precision and recall. The good news:
 The code is simple.
 It has no false positives.
 It has a precision of 100 percent on our (completely made up) examples.
The bad news is that this classifier has a high false negative rate which leads to a recall of only 50 percent.
If we want to increase recall and add Rule 2 to the algorithm, then the code becomes more complex: we need to add more fields to the State
class, and more if
/elif
/else
statements and related housekeeping to process
.
If we want to further refine any of the rules, then we need to add yet more complexity.
We see that tweaking the algorithm by hand to find improvements can get out of control. It can easily result in spaghetti code that will be a struggle to maintain. How can we do better?
Neural networks to the rescue
Mired in our predicament, we reflect on the fact that contains_code
and process
are a state machine.
This fact is notable because state machines can be encoded by recurrent neural networks (RNNs).^{2}
So rather than tweaking the algorithm by hand, we could find a better algorithm by training an RNN on our examples.
The general idea
At a high level, an RNN is an approximation of the conditional probability
\[\Pr({\rm MessageContainsCode} = 1 \mid {\rm Token}_1 = x_1, \ldots, {\rm Token}_T = x_T).\]
This approximation is calculated by processing the token sequence element by element.
For each token, we calculate a vector that does what State
did in our Python code.
This is usually called a hidden state because it is only an interim variable and appears in neither the input nor the output of the model.
We get it by taking into account the previous state:
\[\begin{aligned}
{\rm State}_0 &:= 0 \in \mathbb{R}^S \\
{\rm State}_1 &:= f({\rm Token}_1, {\rm State}_0) \\
{\rm State}_2 &:= f({\rm Token}_2, {\rm State}_1) \\
&\;\;\vdots \\
{\rm State}_T &:= f({\rm Token}_T, {\rm State}_{T1}) \\
\end{aligned}\]
where the function \(f\) represents the hidden layers of the recurrent network.
The message is classified based on the final state:
\[\widehat{\Pr}({\rm MessageContainsCode} = 1 \mid {\rm Token}_1, \ldots) := g({\rm State}_T)\]
where \(g\) is what is called the output layer of the network.
How should we parameterize the functions \(f\) and \(g\)?
Our Python code as math
Let’s think about what a network that encodes Rule 1 looks like. Returning to the previous example sentence, let’s simplify the math that ensues by using fewer types of tokens:
Each token in the sentence is represented as a binary vector, so the input data that we get looks like this (with the zeros erased to make the table easier to read):
\(x_1\)  \(x_2\)  \(x_3\)  \(x_4\)  \(x_5\)  \(x_6\)  \(x_7\)  \(x_8\)  

identifier 
\(1\)  \(1\)  \(1\)  
open_paren 
\(1\)  
close_paren 
\(1\)  
unknown 
\(1\)  \(1\)  \(1\) 
To model the state, we need to add three hidden layers to the network. When we are processing the \(t\)th token in the sequence, we calculate each of the three layers, one after the other. Let’s denote the first layer by \(h_t^1\), the second layer by \(h_t^2\), and the third by \(h_t^3\). The first two layers hold intermediate calculations while the third layer holds our final state after processing the token.
The third layer, \(h_t^3\), corresponds to \({\rm State}_t\) in the schematic presentation above, and the initial state, \(h_0^3\), corresponds to \({\rm State}_0\). This table shows how the hidden state evolves:
\(h_0^3\)  \(h_1^3\)  \(h_2^3\)  \(h_3^3\)  \(h_4^3\)  \(h_5^3\)  \(h_6^3\)  \(h_7^3\)  \(h_8^3\)  

seen_code 
\(1\)  \(1\)  
identifier 
\(1\)  \(1\)  \(1\)  
open_paren 
\(1\)  
p_identifier 
\(1\)  \(1\)  \(1\) 
To encode the handwritten algorithm, we use the binary indicator function, \(\mathbb{1}\{\cdot > 0\}\), which evaluates to 1 if its input is positive and to 0 otherwise.^{3}
Of course, this would be an outstandingly bad choice if we wanted to train an RNN because its derivative is zero almost everywhere which breaks gradient descent.
But for now, we merely want to specify an RNN that mimics the handwritten algorithm.
The State
class that we used is effectively a binary vector, and the binary indicator function ensures that the RNN’s hidden layers will be binary as well.
The key calculation is done in the second and third hidden layers. The second layer checks if the token sequence \((x_{t2}, x_{t1}, x_t)\) satisfies Rule 1. We don’t do this by referencing the three tokens directly. Instead, we store information about them in the first hidden layer, and we reference that:
\[\begin{aligned} h_{t,\tt this\_is\_code}^2 &:= \mathbb{1}\left\{ 2 h_{t,\tt pp\_identifier}^1 h_{t,\tt p\_open\_paren}^1 h_{t,\tt close\_paren}^1  1 > 0 \right\} \\ \end{aligned}\]
The third layer combines this check, \(h_{t,\tt this\_is\_code}^2\), with a memory of whether any earlier part of the sequence has satisfied the rule:
\[\begin{aligned} h_{t,\tt seen\_code}^3 &:= \mathbb{1}\left\{ 2 \left( h_{t, \tt this\_is\_code}^2 + h_{t, \tt p\_seen\_code}^2 \right)  1 > 0 \right\} \\ \end{aligned}\]
Once we have processed all \(T\) tokens, the output is calculated using the final hidden layer:
\[y_T := \mathbb{1}\left\{ 2 h_{T,\tt seen\_code}^3  1 > 0 \right\}\]
The formula for \(h_{t,\tt this\_is\_code}^2\) looks simple enough but it contains a multiplication between \(h_{t,\tt pp\_identifier}^1\), \(h_{t,\tt p\_open\_paren}^1\), and \(h_{t,\tt close\_paren}^1\). If we wanted to include this multiplicative term, we would need a higherorder RNN, similar to the secondorder RNN that Giles et al. (1992) used to discover state machines. Fortunately, with binary hidden layers, we can stay within the framework of a firstorder RNN by replacing the multiplication with a simple sum:
\[\begin{aligned} h_{t,\tt this\_is\_code}^2 &:= \mathbb{1}\left\{ h_{t,\tt pp\_identifier}^1 + h_{t,\tt p\_open\_paren}^1 + h_{t,\tt close\_paren}^1  2 > 0 \right\} \\ \end{aligned}\]
(And when we will change the activation function in the next section so that we can train the network, we will see that a sum still works.)
Now to wrap things up, we also get some auxiliary calculations done that are needed for \(h_{t,\tt this\_is\_code}^2\) and \(h_{t,\tt seen\_code}^3\). In the first hidden layer, we take note of the token that we are processing and copy the previous state:
\[\begin{aligned} h_{t,\tt identifier}^1 &:= \mathbb{1}\{ 2 x_{t,\tt identifier}  1 > 0 \} \\ h_{t,\tt open\_paren}^1 &:= \mathbb{1}\{ 2 x_{t,\tt open\_paren}  1 > 0 \} \\ h_{t,\tt close\_paren}^1 &:= \mathbb{1}\{ 2 x_{t,\tt close\_paren}  1 > 0 \} \\ h_{t,\tt p\_seen\_code}^1 &:= \mathbb{1}\{ 2 h_{t1,\tt seen\_code}^3  1 > 0 \} \\ h_{t,\tt p\_identifier}^1 &:= \mathbb{1}\{ 2 h_{t1,\tt identifier}^3  1 > 0 \} \\ h_{t,\tt p\_open\_paren}^1 &:= \mathbb{1}\{ 2 h_{t1,\tt open\_paren}^3  1 > 0 \} \\ h_{t,\tt pp\_identifier}^1 &:= \mathbb{1}\{ 2 h_{t1,\tt p\_identifier}^3  1 > 0 \} \\ \end{aligned}\]
In the second layer, we copy some values that will be used in the third:
\[\begin{aligned} h_{t,\tt identifier}^2 &:= \mathbb{1}\left\{ 2 h_{t,\tt identifier}^1  1 > 0 \right\} \\ h_{t,\tt open\_paren}^2 &:= \mathbb{1}\left\{ 2 h_{t,\tt open\_paren}^1  1 > 0 \right\} \\ h_{t,\tt p\_seen\_code}^2 &:= \mathbb{1}\left\{ 2 h_{t,\tt p\_seen\_code}^1  1 > 0 \right\} \\ h_{t,\tt p\_identifier}^2 &:= \mathbb{1}\left\{ 2 h_{t,\tt p\_identifier}^1  1 > 0 \right\} \\ \end{aligned}\]
And lastly, in the third layer, we maintain a memory of tokens that we can use later when we process the next token, \(x_{t+1}\):
\[\begin{aligned} h_{t,\tt identifier}^3 &:= \mathbb{1}\left\{ 2 h_{t,\tt identifier}^2  1 > 0 \right\} \\ h_{t,\tt open\_paren}^3 &:= \mathbb{1}\left\{ 2 h_{t,\tt open\_paren}^2  1 > 0 \right\} \\ h_{t,\tt p\_identifier}^3 &:= \mathbb{1}\left\{ 2 h_{t,\tt p\_identifier}^2  1 > 0 \right\} \\ \end{aligned}\]
The illustration at the beginning of the post shows how this network classifies the example sentence. I won’t do it in this post, but we could set it up in PyTorch, too, and verify that it classifies messages exactly as our handwritten algorithm does.
Training the network to discover better algorithms
If we have a recurrent network, then we should be able to train it. But the network as we have parameterized it so far is not amenable to training. Gradient descent, the algorithm that is commonly used to train neural networks, will be stuck because the binary indicator function has a zero slope almost everywhere.
Trainable activation functions
We overcome this hurdle by switching to an activation function called the rectified linear unit (ReLU), defined as \([x]^+ := x \mathbb{1}\{ x > 0\}\). The numerical constants, called weights and biases, also need to be replaced, as these are the parameters that gradient descent will estimate. We end up with
\[\begin{aligned} h_{t,\tt this\_is\_code}^2 &:= \big[ w_{\tt pp\_identifier}^2 h_{t,\tt pp\_identifier}^1 + \\ &\qquad\quad+ w_{\tt p\_open\_paren}^2 h_{t,\tt p\_open\_paren}^1 + \\ &\qquad\quad+ w_{\tt close\_paren}^2 h_{t,\tt close\_paren}^1 + b_{\tt this\_is\_code}^2 \big]^+ \\ \end{aligned}\]
and so on for the hidden layers. For the output layer, we use a sigmoid activation function:
\[y_T := \frac{1}{ 1 + \exp\left(  w_{\tt seen\_code}^y h_{T,\tt seen\_code}^3  b_{\tt seen\_code}^y \right) }.\]
And we are ready to plug this into PyTorch and train it.
However, what we will find if we do that is that while it is possible to get this network to learn something from the data, its performance is far from exceptional. We can tackle the code detection problem more effectively if we use some of the other options offered by PyTorch.
Architectures with more efficient implementations
One reason for the lackluster performance of our recurrent network is that its architecture is somewhat atypical. As a consequence, more of our training procedure is executed in Python glue code and less of it in PyTorch’s C++ library. An alternative that is provided off the shelf by PyTorch, and that thus has a more efficient implementation, is the Elman RNN. Our architecture differs from Elman’s in that in our network, each hidden layer takes only the previous layer as input: the first layer for token \(t\) takes the third layer for token \(t1\), the second layer takes the first, and the third layer takes the second. In vector notation:
\[\begin{aligned} h_t^1 &:= \left[ w^1 \left( x_t', {h_{t1}^3}' \right)' + b^1 \right]^+ \\ h_t^2 &:= \left[ w^2 h_t^1 + b^2 \right]^+ \\ h_t^3 &:= \left[ w^3 h_t^2 + b^3 \right]^+ \\ \end{aligned}\]
If we followed Elman’s architecture, the layers would be different in two ways. First, each hidden layer for token \(t\) would also take the same layer for token \(t1\) as input. Second, the first hidden layer would not take the final layer for the previous taken as input. This becomes clearer when we compare the formulas shown above with those of the Elman RNN:
\[\begin{aligned} h_t^1 &:= \left[ w^1 \left( x_t', {h_{t1}^1}' \right)' + b^1 \right]^+ \\ h_t^2 &:= \left[ w^2 \left( {h_t^1}', {h_{t1}^2}' \right)' + b^2 \right]^+ \\ h_t^3 &:= \left[ w^3 \left( {h_t^2}', {h_{t1}^3}' \right)' + b^3 \right]^+ \\ \end{aligned}\]
Architectures with more numerically stable gradients
Another reason for the subpar performance of our recurrent network is that in realworld code review, engineers often send longer messages. Longer messages translate into longer token sequences which can throw off the gradient descent algorithm. Gradient descent works in theory but not always in practice: the gradients, obtained by diligently applying the chain rule, can get too close to zero which causes problems with numerical stability.
While Elman’s architecture is susceptible, too, other architectures attempt to mitigate this. So we might well find that a gated recurrent unit or a long short term memory network performs better for our code detection task.
Datadriven discipline
Recurrent neural networks have handled our inevitable spaghetti code better than we could ourselves. But they have done more than that: they have imposed a certain datadriven discipline on us that we would probably not follow if we wrote the algorithm by hand. To train a network,
 we need to select training and validation data sets,
 we need to prelabel them, and
 we need to specify a loss function that makes it explicit what we want the classifier to achieve and what we don’t.
This discipline forces us to clarify our thinking, as we will inevitably find gray areas that we didn’t expect. And it turns out that this discipline is useful even for those problems that we decide to solve by hand.

The first word, “LGTM,” is a commonly used abbreviation of “looks good to me.” Why is it coded as
all_caps_identifier
rather than something else? Because in idiomatic C, constants are given alluppercase names, so without a list of common abbreviations at hand, we err on the side of assuming that “LGTM” could be a constant in some program code.↩︎ 
Marvin Minsky’s 1967 book, Computation: Finite and Infinite Machines, already pointed out that recurrent neural networks can encode state machines. Later, in the late 1980s and through the 1990s, there was a burgeoning literature on how to discover state machines by training appropriately specified recurrent networks (e.g., Giles et al., 1992).↩︎

As a littleknown trivia, this is also occasionally called the Heaviside step function or the Heaviside activation function.↩︎