Mathematical Software Engineering

Monday, December 14, 2015

WTF is CORS?

You've heard of CORS right? It's that thing that lets you like... make an ajax call to some other site? Or something? Yeah. It can be a bit fuzzy.

tl;dr: If you want a page hosted on `sunnyvale.ca` to call `api.sunnyvale.ca/lots/open`, just make sure the client specifies that it is making a CORS request, and have the backend application return `Access-Control-Allow-Origin: sunnyvale.ca` as part of its headers.

Can't I just make HTTP requests to whatever host I want?
Well, yes. You can in curl, or some other basic http client. But modern browsers have all agreed to implement a rule that prevent scripts hosted on Site A from making ajax calls to Site B. Just like showing lock icons for sites with valid SSL certificates, this is so-called "same-origin policy" is something browsers do to make consumers safer.

Why is this needed?
Suppose some asshole buys the domain `paypalloginhelper.com` and registers an SSL certificate for it. They copy all the image and css assets from paypal, but modify the client-side javascript so that it does the following when the user tries to log in:

1. Send the login credentials to `paypal.com/login/auth`
1. Upon success, send a copy of the login credentials to `paypalloginhelper.com/steal/password`
1. Request and display the content from `paypal.com/user/123/account_info`

This would allow the attacker to gather login credentials from anyone they could trick into going to their site. With additional measures (such as [CSRF Protection](https://www.owasp.org/index.php/Cross-Site_Request_Forgery_%28CSRF%29_Prevention_Cheat_Sheet)), PayPal could detect and reject such requests, but the same-origin policy allows browsers to prevent them altogether.

So what is CORS, again?
Invoking `api.sunnyvale.ca/lots/open` from within the `sunnyvale.ca/index.html` document is an example of Cross-Origin Resource Sharing (in browser lingo, `sunnyvale.ca` and `api.sunnyvale.ca` are two separate *origins*). Since the browser normally stops such behavior, we need to ask for an exception. The browser will try to confirm this exception by double checking with our API server. If the API server says `sunnyvale.ca` is allowed to make CORS requests, the browser will give us a thumbs up and proceed with our request.

Step 1 -- Client tells browser it wants to make a CORS request
A jQuery example due to [Yuriy Dybskiy](https://github.com/html5cat/cors-wtf) shows how to initiate a CORS request from a document on `sunnyvale.ca`:

{% highlight javascript %}
$.ajax({
type: 'GET',
url: 'https://api.sunnyvale.ca/lots/open',
crossDomain: true // <=== this is the important bit
}).done(function (data) {
// print open lots to console
console.log(data);
});
{% endhighlight %}

Step 2 -- Browser asks server if sunnyvale.ca is on its VIP list
Before sending our GET request, the browser will first issue what is called a *pre-flight* request in order to check whether our intended request is even allowed. To do this, the browser sends a request with the 'OPTIONS' verb to ask the server to confirm that we are able to send such a 'GET' request from our origin:

{% highlight http %}
OPTIONS /lots/open HTTP/1.1
Host: api.sunnyvale.ca
Origin: https://sunnyvale.ca
Access-Control-Request-Method: GET
{% endhighlight %}

Recall here that CORS and the same-origin policy are browser-level features. It does no good to write code to discriminate requests based solely on the Origin header; attackers will lie to you anyhow. We simply need to tell the browser what Origins we expect such calls from. These features are for the user's protection, not yours.

Step 3 -- Server confirms that it expects CORS requests from sunnyvale.ca
To allow CORS requests from documents on `sunnyvale.ca`, include the `Access-Control-Allow-Origin` header in your reply, like so:

{% highlight http %}
HTTP/1.1 200 OK
Access-Control-Allow-Origin: https://sunnyvale.ca
Some-Other-Headers: "Whatever"
{% endhighlight %}

That tells the browser that we know about `sunnyvale.ca`, and that we expect documents hosted there to make CORS requests to us. That will assuage the browser's concerns, and it will then proceed with the original request *like a normal ajax call*.

Why can't browsers just assume that subdomains are safe?
In most cases, `whatever.com` and `api.whatever.com` will be managed by the same team. So why not just have browsers trust requests to `*.whatever.com`?

Think about sites like GitHub that allow users to manage content on their own subdomains. Similar to the hypothetical PayPal example above, a malicious user could register the name `developerlogin` and then control content for the browser origin `developerlogin.github.com`. Explicitly specifying which origins are allowed to make CORS requests prevents this means of leveraging a users' browser in a data-sniffing attack.

References
1. [HTTP Access Control (CORS)](https://developer.mozilla.org/en-US/docs/Web/HTTP/Access_control_CORS) -- [Mozilla Developer Network](https://developer.mozilla.org/en-US/).
1. [CORS W.T.F.?! or 'What is the best way to play with cors locally?'](https://github.com/html5cat/cors-wtf) -- [Yuriy Dybskiy](http://dybskiy.com/).
1. [Cross-Site Request Forgery (CSRF) Prevention Cheat Sheet](https://www.owasp.org/index.php/Cross-Site_Request_Forgery_%28CSRF%29_Prevention_Cheat_Sheet) -- [Open Web Application Security Project](https://www.owasp.org/index.php/Main_Page).

Friday, November 27, 2015

Speed Up That Nokogiri Build

An important part of continuous integration is making sure that you can rebuild your environment from scratch. Unfortunately for most Rails apps, running bundle install
can cause your CI job to hang for several minutes, and nokogiri is usually the culprit.


Don't rebuild LibXML from scratch
The reason it takes so long to install nokogiri is that, by default, it will try to build its own copies of libxml and libxslt, which are huge. That definitely makes initial adoption easier, but now we're stuck with it, so let's speed things up a bit.

Fortunately, the nokogiri gem takes a build option that tells it to look for system-installed xml and xslt libraries. To get Bundler to use it, we create an entry in the local bundler config and then install it like so:


bundle config build.nokogiri --use-system-libraries
bundle install


Clean up after yaself
The trouble with this approach is that the bundle config is not ephemeral, and if you are running multiple builds on the same Jenkins server, the config will persist and affect other branches. I recommend sticking the following in your `test.sh` right after your gems are installed:


bundle config --delete build.nokogiri

Doing this for any specific gem build settings after you run `bundle install` will make sure that you do not get screwy environment behavior when running future builds.

Sunday, February 22, 2015

Maintaining Public and Private code in the Same Repo

Let's face it, git submodules suck. They are confusing, fragile, and orthogonal to the ecosystem of commits and refs that make up the rest of git's functionality. Fortunately, for some use cases, there are other routes.


Private Apps with Public Libs
Sometimes while working on a private application, you might whip up a compact, unrelated set of modules to solve a small problem, or provide some extra functionality to the rest of your development environment. Often, this stuff might be applicable in a wider context. Occasionally, your employer will even give you the thumbs up to share it with the outside world. Pretty cool right? For a brief moment, you can get paid to work on your own open source project.

Now you're faced with the task of restructuring this code as a separate package with and including it into your main codebase as a proper dependency. This means you'll have to maintain it as an entirely separate package, and any new functionality you add needs to be rebuilt and then distributed to your private app just as any third party dependency would: while this is certainly The Right Way to Do It, it can also be a pain in the ass.


Merge-only branches
In some narrow cases, it can make sense to have special "merge-only" branches that don't contain any proprietary code from your private app. Instead, they only contain the code for the open source software you want to share publicly.


git checkout <SHA of first commit>
git checkout -b public
# Hack away on public code


If you're starting this public project in the middle of an existing proprietary project, you'll want to make sure you branch from a commit that has no proprietary code. The first commit to your project might be an appropriate place to start, but it might also have sensitive data (i.e. the first import of your code from a previous version control system). If that is the case, take a look at GitHub's guide to [Removing sensitive data](https://help.github.com/articles/remove-sensitive-data/).

Once you've got a clean commit to branch from, you can set up special remotes for your public branch. For example:


git remote add public-origin https://github.com/CoolwareInc/cool-public-lib.git
git push -u public-origin master


Now you can work on your publicly available libs, tag them, and the merge them into your proprietary codebase

vim lib/awesome.rb
git commit -m "Refactor Awesome#helper to be 1,000 times faster"
git tag -a v1.2.3 -m "Performance improvements to libAwesome"
git push --follow-tags #Push any tags that reference pushed commits
# Now import back into private app
git checkout master
git merge public



Downsides
This approach isn't perfect. Namely, it can be pretty easy to do public work in a private branch, making it a mess to back those changes out later. Or worse, you could accidentally push your private commits to the public remote. If you are interested in going down this route, take a look at [Git Hooks](http://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks), specifically the `pre-push` hook, which should allow you to double check whether you are trying to push private refs to the wrong remote. 


References
[Push git commits & tags simultaneously](http://stackoverflow.com/a/3745250)

Sunday, April 14, 2013

Group Theory for 5th Graders

In 2011 I was very fortunate to be awarded an internship as an elementary school teaching assistant in my hometown. I worked with 3rd, 4th, and 5th graders who were having difficulties with mathematics.

One thing that I noticed while working with these students is that they are still curious enough to be motivated by the fact that something is cool in its own right without having to scrutinize its utility. In some sense, curiosity is the only genuine prerequisite for mathematics. Unfortunately, we squander this resource by offering a curriculum barren of anything not assumed to be helpful on the end of year exams$^1$.

Given that this constraint is not going away any time soon, what more can we do with the curriculum we already have? Let's consider a few points:

Starting in 3rd grade, students begin to build and work with multiplication tables, which, aside from closure, are not unlike Cayley tables.
Also in 3rd grade, students learn to tell time by reading a clock. They are asked questions such as "What time is 5 hours before 3:00?" and "If it is 3:15, where will the minute hand be in 1 hour?". These are precisely the same questions that undergraduate math students are asked in an initial survey of $\mathbb{Z}_{12}$ and $\mathbb{Z}_{60}$.
Fourth graders are asked to divide two natural numbers and consider the remainder
In order to simplify fractions, fifth grade students are asked to find the greatest common factor of two natural numbers. This is a great opportunity to introduce prime factorization.
Notions of Reflection, Rotation, and Translation are covered explicitly in 5th grade geometry standards, at least in Tennessee.

So, why not introduce group theory to fifth graders? The only thing they are really lacking is a structured concept of a "Set". Further, given that all of these students are headed towards an algebra curriculum$^2$, it makes sense to introduce them to concepts like operators, inverses, identities, associativity, and commutativity ahead of time in a structured, possibly even visual way, rather than relying on an intuitional approach later on.

For example, consider the task of explaining how fractions are added. This involves rewriting one or both fractions in terms of a common denominator. There are two challenges here: (a) how does one find a common denominator? and (b) why is it legitimate to rewrite a fraction as some other fraction?

The answer to the second question hinges, mathematically, on the concept of a group identity element. In this case, we know that for any fraction $\frac{a}{b}$, it is the case that $1 \cdot \frac{a}{b} = \frac{a}{b}$. However, it is also the case that for any $n \in \mathbb{Z}$, $\frac{n}{n} = 1$. Thus, the natural connection to make is that for any $n \in \mathbb{Z}$, $\frac{a}{b} = \frac{na}{nb}$. I have seen many students struggle to connect those two facts, and my guess is that it seems to violate their mathematical intuition. I am very suspicious that discussing the identity concept more explicitly would help clarify this.

Another concept that frequently troubles students at this stage is that statements such as "$\frac{1}{7} \cdot 7 = 1$" should be true. One way to introduce this concept is by representing "parts of a whole" visually, as in pizza slices. However, presenting this as a necessary repercussion of the fact that group elements have inverses captures the idea without relying on intuition.

$^1$ See any work by Diane Ravitch for a more thorough discussion of this problem.

$^2$ Ostensibly. See Tavis Smiley's work on the "School to Prison Pipeline".

Sunday, March 31, 2013

Leveraging Sow to Simplify Loops

One of Mathematica's coolest list manipulation techniques is the Reap/Sow pattern. Using these functions together allows you to build up multiple collections at once in a decoupled manner.

To understand this idea and when it might make a difference, consider how you would sort a list of integers by divisor. What I mean is, given the list \[\{12,13,14,15,21,24,28,41,47,49,55\}\] sort it into the following lists based on divisor: \[\begin{align} 2 &\rightarrow \{12,14,24,28\} \\ 3 &\rightarrow \{12,15,21,24\} \\ 5 &\rightarrow \{15,55\} \end{align}\] (and here it's okay if one item shows up in two lists)

You might think to do it by using a for-loop and storing each item in a named list depending on what it matches. But if you want a variable list of divisors? In this case, managing the list of results can get a little tricky.

The Reap/Sow pattern presents a different approach: when important values are computed inside a function they can be "sown" with Sow, which means they bubble up the call stack until a matching Reap expression is encountered. You can use a Reap expression to process or discard sown values as you see fit.

An implementation of the divisor sorting algorithm using Reap and Sow might look something like this:

SortByDivisors[numbers_,divisors_]:=Reap[
  For[i=1, i $\leq$ Length[numbers], i++,
    For[j=1, j $\leq$ Length[divisors],j++
      n = numbers[[i]];
      d = divisors[[i]];
      If[Mod[n,d] == 0, Sow[n,d]];
    ];
  ];, n_, Rule][[2]];

And it will return a collection of Rules, each of which has a divisor as the key and a corresponding list of multiples as the value.

Sunday, March 17, 2013

Employing "Map" to Make Mathematica More Elegant

One of Mathematica's most under-used features is its support for functional programming. Many researchers treat Mathematica as a procedural language; that is, as though it were C or FORTRAN with interactive plotting. This certainly works but, with apologies to Dijkstra, it can lead to code that is neither simple nor clear -- in short: not what mathematicians would call "elegant".

One quick way to make your Mathematica code more elegant is to use the Map command in place of For loops when building up arrays. Map is not an outright replacement for For loops, but it can be very helpful when you are trying to transform one data set into another.

For a simple example, let's say that you need to compute the sine of a set of angles. If you were doing this in the style of a procedural programming language, your code might look like the following:

input = {0,$\frac{\pi}{3}$,$\frac{2\pi}{3}$,$\pi$};

output = Array[0,Length[input]];
For[i = 1, i $\leq$ Length[input], i++,
output[[i]] = Sin[input[[i]]];
];

Now, that really doesn't look so bad. It's only three lines of code. However, it's only one idea: the transformation of a single data set. It would be more elegant if we could express this one idea in a single -- readable -- line of code. Fortunately, this is exactly the task for which Map was intended:

output = Map[Sin,input];

This statement applies the Sin command to each element of the input list and captures the results in an output list in corresponding order. It is equivalent to the previous example in terms of behavior and speed, yet it is superior in terms of clarity because it expresses the idea more compactly.

Of course, for a single loop, the payoff in clarity will be minimal. On the other hand, if you make it a habit to express transformations in this fashion, the benefits you reap from code simplicity will grow along with your project.

Sunday, March 3, 2013

Using Linear Algebra to Teach Linear Algebra

Linear Algebra is supposed to be the study of linear transformations between vector spaces. However, it can be hard to tell that from the way Linear Algebra classes usually start -- i.e. a disconnected, unmotivated survey of row manipulation operations.

To be fair, this discussion isn't entirely unmotivated. It's usually presented in the context of Gaussian elimination for the purpose of solving a system of equations. While that's certainly not inaccurate, presenting the material only from that perspective unnecessarily narrows its scope in the mind of the student, making it harder to generalize later. The problem is three-fold:

row manipulation is presented as something that is specifically "for" equation solving
the row manipulation operations are presented as external algorithms
the matrix concept is treated as a passive thing (a data structure), rather than an active thing (a transformation).

Why do this? Why introduce extra algorithms to fiddle with values in a 2D array? Linear Algebra already provides an algorithm powerful enough to do all this stuff and more: matrix multiplication.

For example, let's start with the following matrix:

\[\left(\begin{array}{ccc} a & b & c \\ d & e & f \\ g & h & i \end{array}\right)\]
Now suppose we want to interchange Row 1 with Row 2. We can do this by multiplying on the left using a special matrix designed for interchanging those rows:

\[
\left(\begin{array}{ccc} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{array}\right) *
\left(\begin{array}{ccc} a & b & c \\ d & e & f \\ g & h & i \end{array}\right) =
\left(\begin{array}{ccc} d & e & f \\ a & b & c \\ g & h & i \end{array}\right)
\]
Another common row manipulation operation is to add a scalar multiple of one row to another. Let's say we want to triple Row 1 and add those values to Row 3. Again, We can achieve this via left multiplication with a special matrix designed for that purpose:

\[
\left(\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 3 & 0 & 1 \end{array}\right) *

\left(\begin{array}{ccc} a & b & c \\ d & e & f \\ g & h & i \end{array}\right) =
\left(\begin{array}{ccc} a & b & c \\ d & e & f \\ 3a + g & 3b + h & 3c + i \end{array}\right)
\]
Two questions arise here: Primarily, how are these special matrices constructed? Also, what is the advantage to even doing any of this?

Constructing these matrices becomes obvious once we invoke one of the fundamental principles of Linear Algebra: the matrix representation of any linear transformation comes from applying that transformation to the identity matrix.

So, if you'll notice, our matrix for swapping rows 1 and 2 was constructed by simply swapping rows 1 and 2 of the identity matrix. Likewise, our matrix from adding the triple of row 1 to row 3 was constructed by tripling row 1 of the identity matrix and adding it to row 3 of the identity matrix.

That also partially answers the question "What is the advantage?". As a pedagogical tool, this would provide an early opportunity to teach the core notions of Linear Algebra without bogging the student down in what is frequently perceived as accounting homework.

However, there is a further advantage in that tedious row manipulation algorithms can be represented compactly as products of their corresponding matrices. Not only does this allow for an early discussion of the composition of linear transformations, but taking a giant list of row operations and expressing it compactly as a single matrix is an excellent way to demonstrate that Linear Algebra is Powerful.