A quick dive into a SQL query and the nuances of filters and limits.
Recently I worked with @peckjon on a Transposit connector for Algorithmia and an app that uses it. We ended up with a pretty interesting demonstration not just of the Algorithmia connector, but also the power of SQL for APIs.
We wanted to create a SQL query that gets commit messages from GitHub for a given repository and sends them to a sentiment analysis algorithm in Algorithmia. This is easy to do with Transposit and the final query looks like this:
SELECT * FROM (SELECT C.commit.message, S.result.sentiment FROM github.list_commits AS C JOIN algorithmia.call_algorithm AS S ON S.$body.document = C.commit.message WHERE C.owner='seattlerb' AND C.repo='heckle' AND S.user='nlp' AND S.algorithm='SentimentAnalysis' LIMIT 100) WHERE sentiment > 0.2 OR sentiment < -0.2 ORDER BY sentiment
This query has few interesting nuances, but first let's start with the basics.
To get the commit messages from GitHub for a given repository we can use this query:
SELECT * FROM github.list_commits WHERE owner='seattlerb' AND repo='heckle'
To send text to Algorithmia for sentiment analysis we can use this query:
SELECT * FROM algorithmia.call_algorithm WHERE user='nlp' AND algorithm='SentimentAnalysis' AND $body.document = 'text to send to the sentiment analysis algorithm'
We can use
JOIN to combine these queries together:
SELECT C.commit.message, S.result.sentiment FROM github.list_commits AS C JOIN algorithmia.call_algorithm AS S ON S.$body.document = C.commit.message WHERE C.owner='seattlerb' AND C.repo='heckle' AND S.user='nlp' AND S.algorithm='SentimentAnalysis'
As expected, this query will get the list of commits from GitHub and for each one will call the sentiment analysis algorithm with the commit message.
When you run a similar query in a relational database the DB engine will try to match all the rows in both tables based on the condition in the ON clause.
Transposit will basically do the same with the underlying APIs, except for two differences:
S.$body.documentis an input parameter for the Algorithmia API, so instead of trying to "match" the values Transposit will send the commit message as a parameter to the Algorithmia API.
Another interesting part in this query is in the last
WHERE sentiment > 0.2 OR sentiment < -0.2
as you can see this
WHERE is in the outer
SELECT, and the
JOIN is in a subquery.
We can reduce this type of query to the following template:
SELECT * FROM (SELECT * FROM source WHERE [subquery filters] LIMIT 100) WHERE [outer query filters]
At first look this query looks the same as:
SELECT * FROM source WHERE ([subquery filters]) AND ([outer query filters]) LIMIT 100
But these queries are quite different: the first query tells the SQL engine to do the
subquery filter and the
limit 100 before the
outer query filters and the second query applies the limit after both queries.
These two queries mean different things and can yield quite different results. The power of SQL is that you can do a lot by saying a little. As with any high level language, it’s important to mean what you say. Do you want to find “interesting” messages among the first 100 commits or do you want to find the first 100 polarizing commit messages? The former produces a well—understood amount of work; on a bland repository, the latter could churn until you hit a rate limit. Just like SQL for traditional databases, with SQL for APIs it’s important to say what you mean (and go crazy in those commit messages).