Surprise : I am mixing your data
When it comes to doing SQL queries onto a relational database there is a time when this query needs to use rows from two tables to give you the rows you want. These are called “join” queries. Their purpose is to combines records from two tables, let's see how that can end up.
When it comes to doing SQL queries onto a relational database there is a time when this query needs to use rows from two tables to give you the rows you want.
These are called "join" queries. Their purpose is to combines records from two tables. The usual diagrams shows two circles intersecting : an inner join query would return you what’s common to both circles (i.e. tables), the left (outer) join would give you what’s only in the left circle, and a right (outer) join would give you what’s only in the right circle.
The trick myself and some team members walked into this week is that those queries are meant to return you a mix of the rows of the two tables. Row by row, the ones matching are mixed together and returned.
Let's see how this happen, and how to avoid it.
Note :
As this post is fairly code oriented I decided to not record an audio version of it, apologies.
Preparing the trap
Let's make a minimal setup here : two ruby classes, the Sequel gem, a MySQL host.
require 'sequel'
MYSQL_DB = Sequel.connect(adapter: ‘mysql2’, host: ‘127.0.0.1’, user: ‘root’, password: ‘password’, port: 3306, database: ‘join_test’)
class MySQLBook < Sequel::Model(MYSQL_DB[:books]); end
class MySQLChapter < Sequel::Model(MYSQL_DB[:chapters]); end
class DbSetup
attr_reader :connector
def initialize(connector)
@connector = connector
setup_db
insert_data
end
private
def setup_db
if connector.table_exists?(:chapters)
connector[:chapters].truncate
else
connector.create_table :chapters do
primary_key :id
String :name
String :text
String :published
Integer :book_id
end
end
if connector.table_exists?(:books)
connector[:books].truncate
else
connector.create_table :books do
primary_key :id
String :name
String :intro
String :published
end
end
end
def insert_data
books = connector[:books]
books.insert(name: ‘Babc’, intro: “Lorem Ipsum is simply dummy text of the printing and typesetting industry.”, published: “yes”)
books.insert(name: 'Bklm', intro: "Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).”, published: “no”)
published_book_1 = books.where(name: ‘Babc’).first
unpublished_book_1 = books.where(name: ‘Bklm’).first
chapters = connector[:chapters]
count = 1
chapters.insert(name: “Chapter #{count}”, book_id: published_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
count +=1
chapters.insert(name: “Chapter #{count}”, book_id: published_book_1[:id], published: ‘no’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
count +=1
chapters.insert(name: “Chapter #{count}”, book_id: published_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
count +=1
chapters.insert(name: "Chapter #{count}", book_id: published_book_1[:id], published: 'no' , text: 'All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ')
count = 1
chapters.insert(name: "Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
count +=1
chapters.insert(name: “Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘no’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
count +=1
chapters.insert(name: “Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
count +=1
chapters.insert(name: “Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘no’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
end
end
DbSetup.new(MYSQL_DB)
We can save this in a file named db_setup.rb
.
This code is fairly simple : we start by initialising a connector to MySQL through Sequel, then we define two classes inheriting from the Sequel::Model
class, then we setup a pair of tables. The two classes are named MySQLChapter
and MySQLBook
as this article is based on the relation between chapters and books to show the specific case of join queries in Sequel.
This expects that a database was created within the MySQL host, and also notice the username and password here, you will need to adjust to your local setup.
The setup of the table creates the tables themselves with a few columns including a column name
in both and a column book_id
in the chapters
table. Then we insert a few rows in both tables to have enough data to make sense.
To use this we need to install Sequel, so let’s prepare a simple Gemfile
for this.
source 'https://rubygems.org'
ruby '2.6.5’
gem ‘sequel’
gem ‘mysql2’
gem ‘postgres’ # that's for later
gem ‘byebug’ # always handy
Once saved you can run bundle install
and then bundle exec ruby setup.rb
. This will connect to MySQL, create the tables and insert data. If the tables already exists they will get truncated (all rows removed) and rows will be reinserted.
A simple query first
Sequel is similar to ActiveRecord or other ORMs out there. But, just like DataMapper, it’s still different to it. Let’s see a couple of base queries.
First let’s get our favorite ruby inline console running. You can go with irb
or pry
for example.
In there let's require Sequel
, initialize a simple connector to the MySQL DB and two classes. Basically the first part of the previous code listing.
require 'sequel'
MYSQL_DB = Sequel.connect(adapter: ‘mysql2’, host: ‘127.0.0.1’, user: ‘root’, password: ‘password’, port: 3306, database: ‘join_test’)
class MySQLBook < Sequel::Model(MYSQL_DB[:books]); end
class MySQLChapter < Sequel::Model(MYSQL_DB[:chapters]); end
By using a model name you can directly run queries.
Counting all MySQLBooks in the table with the count
class method :
[9] pry(main)> MySQLBook.count
=> 2
Getting the first MySQLBook with the first
class method, and all of them with the all
class method :
[10] pry(main)> MySQLBook.first
=> #<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>
[11] pry(main)> MySQLBook.all
=> [#<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>,
#<MySQLBook @values={:id=>2, :name=>"Bklm", :intro=>"Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).", :published=>"no"}>]
And, here we can see the first different with ActiveRecord, getting a specific book by passing its id between square brackets :
[12] pry(main)> MySQLBook[1]
=> #<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>
The class method find
is also available.
Getting a specific book with a specific title can be done with where
:
[14] pry(main)> MySQLBook.where(name: "Babc")
=> #<Sequel::Mysql2::Dataset: "SELECT * FROM `books` WHERE (`name` = 'Babc')">
But this is not the row we are looking for, this is merely a ruby expression of the SQL query that will be run.
To get the query itself we can use the sql
method added to this :
[15] pry(main)> MySQLBook.where(name: "Babc").sql
=> "SELECT * FROM `books` WHERE (`name` = 'Babc')"
And to get the rows themselves we need to use the all
method (it will return an array), or the first
method if we just want to get the first entry :
[16] pry(main)> MySQLBook.where(name: "Babc").all
=> [#<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>]
[17] pry(main)> MySQLBook.where(name: "Babc").first
=> #<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>
As pointed out in the intro of this part : this is very similar to ActiveRecord.
Now, let's see how we can something more complex to work.
Join
SQL is a very powerful query language. It’s evolved for 50 years and is full of features. Some database products out there don’t entirely follow its standards but most do.
When one wants to filter rows of one table based on the values of the rows of another table they are related to one usually turns to either do several queries in a script or use a SQL JOIN.
Let's first see what multiple queries would look like. Using our chapters and books let’s list only chapters that are published and part of published books.
published_chapters = MySQLChapter.where(published: "yes").all
filtered_chapters = published_chapters.select do |chapter|
book = MySQLBook.find(chapter.book_id)
book.published == "yes"
end
The first line will get all published chapters, then we proceed by filtering out the ones that are not related to a published book.
Array#select
This method allows to go through each item within an array and return an array with items for which the block passed toselect
has returnedtrue
. There is a “mirror” to this method calledreject
.
This definitely works, there is one issue though : it leaves all this computation to be done on the ruby side. If there is only a few chapters and books to go through it’s ok, but if there is a lot of entries it will get costly.
The solution is to do this within one SQL query by using a JOIN.
SQL JOIN
Join queries are meant to return data from two tables that match a certain number of conditions. There are different types of join queries depending on what we want to match : inner join, left outer, right outer, full outer and cross.
The usual explanation involves two over lapping circles representing the two tables. The surface overlapped by both circles is used to represent the “inner join” : the data that match the query in both tables. As this is the type of Join we want here we are not going to cover more than that. See the links at the bottom of the article to get more details on this.
Expressed simply we want the published chapters related to published books. Sequel supports joins directly on models. Let’s try this.
[27] pry(main)> MySQLChapter.where(published: "yes").join(MySQLBook.where(published: "yes"), id: :book_id).all
Sequel::DatabaseError: Mysql2::Error: Column 'published' in where clause is ambiguous
...
Caused by Mysql2::Error: Column 'published' in where clause is ambiguous
...
The query is built as an extension of what we saw previously : MySQLChapter.where(published: "yes”)
. We add the join clause join(MySQLBook, id: :book_id)
by passing the class we want to check against and a pair of keys to tell SQL what column to use to related the rows. Here there id
column of the MySQLBook model should be matched with the book_id
of the MySQLChapter one. Since we only wanted published books we also add a condition on the published
column : MySQLBook.where(published: "yes"), id: :book_id
.
This thows an exception as both tables have published
column, SQL doesn’t know if it needs to use only one of the published
columns, or both and in which order.
By the way this generates the following SQL :
[28] pry(main)> MySQLChapter.where(published: "yes").join(MySQLBook.where(published: "yes"), id: :book_id).sql
=> "SELECT * FROM `chapters` INNER JOIN (SELECT * FROM `books` WHERE (`published` = 'yes')) AS `t1` ON (`t1`.`id` = `chapters`.`book_id`) WHERE (`published` = 'yes')"
The solution is to attach the name of the related table to each reference to the column.
MySQLChapter.from(Sequel[:chapters].as(:chapter)).
where(Sequel[:chapter][:published] => 'yes').
join(MySQLBook.from(Sequel[:books].as(:book)).
where(Sequel[:book][:published] => 'yes'), id: :book_id).all
This is starting to get heavy (but we are not done yet). The trick is to use the Sequel[<table_name>].as(<custom_name>)
and then Sequel[<custom_name>][<column_name>]
syntaxes to name the tables and then attach them to the column names.
In practice this is what it does :
[29] pry(main)> MySQLChapter.from(Sequel[:chapters].as(:chapter)).where(Sequel[:chapter][:published] => 'yes').join(MySQLBook.from(Sequel[:books].as(:book)).where(Sequel[:book][:published] => 'yes'), id: :book_id).all
=> [#<MySQLChapter @values={:id=>1, :name=>"Babc", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1, :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry."}>,
#<MySQLChapter @values={:id=>1, :name=>"Babc", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1, :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry."}>]
We get rows back and plenty of data. Yet … if one looks closely :
- the class of each value is
MySQLChapter
; we want to handle chapters so, that’s what we expect. - the value of
name
for each result is the value of thename
column for the book related to that chapter. This is not what we want. We can also see that theintro
of the book is included in the result too.
This is due to the use of join
. The inner join will give back a mix of data of the rows from both tables when they match together.
The fix
This is not an issue, it's actually a feature. But since we are handling models, it’s not very convenient to be getting attributes from the MySQLBook in the middle of instances of MySQLChapter.
The SQL query is a bit more obvious as to why this happens :
[30] pry(main)> MySQLChapter.from(Sequel[:chapters].as(:chapter)).where(Sequel[:chapter][:published] => 'yes').join(MySQLBook.from(Sequel[:books].as(:book)).where(Sequel[:book][:published] => 'yes'), id: :book_id).sql
=> "SELECT * FROM `chapters` AS `chapter` INNER JOIN (SELECT * FROM `books` AS `book` WHERE (`book`.`published` = 'yes')) AS `t1` ON (`t1`.`id` = `chapter`.`book_id`) WHERE (`chapter`.`published` = 'yes')"
The important point is the use of SELECT * FROM …
. Strictly speaking this tells SQL to give you back all the columns that match the rest of the query.
To avoid this, one needs to tell SQL to only returns columns that are related to the chapters table : SELECT chapters.* FROM …
. Sequel has it’s own way to let you do this : MySQLChapter.(Sequel[:chapters].as(:chapter)).select(Sequel.lit(‘chapter.*’)).
.
The full query is the following :
MySQLChapter.from(Sequel[:chapters].as(:chapter)).select(Sequel.lit('chapter.*')).
where(Sequel[:chapter][:published] => 'yes').
join(MySQLBook.from(Sequel[:books].as(:book)).
where(Sequel[:book][:published] => 'yes'), id: :book_id)
And this returns the right results :
[32] pry(main)> MySQLChapter.from(Sequel[:chapters].as(:chapter)).select(Sequel.lit('chapter.*')).where(Sequel[:chapter][:published] => 'yes').join(MySQLBook.from(Sequel[:books].as(:book)).where(Sequel[:book][:published] => 'yes'), id: :book_id).all
=> [#<MySQLChapter @values={:id=>1, :name=>"Chapter 1", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1}>,
#<MySQLChapter @values={:id=>3, :name=>"Chapter 3", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1}>]
Looping back
The use of an ORM is often decried by some but one can’t overlook the fact that they have given many people a way to quickly build products that work. When and only if performance starts to matter we can look at refining the queries that are run.
Sequel is an interesting library to do small and big relational data models for a Ruby project. When needed you can absolutely do complex queries such as the ones above with it.
Yet, Sequel doesn't help much if you are not familiar with the SQL features you are trying to use. But that’s not a sin, thankfully you can go on the web and do plenty of reading on both SQL and Sequel to figure out a solution.
There is also a large community of people out there that will be happy to give you tips and tricks on those, or colleagues that are possibly more experienced and that can give you a hand or pointers to more documentation.
ORMs have hidden many complexities of SQL to myself for years and I am quite happy to stumble upon this kind of problems from time to time to refresh my memory. This one is kind of a text book example of SQL JOIN but the use of it with Sequel makes it a bit odd and leaves me with the impression that I might have been better off writing a plain SQL query install of something that has so much sugar that it’s barely readable compared to the vanilla SQL.
So, SQL or Sequel, that's up to you. Have fun.
Just one more thing ...
Users of ActiveRecord might wonder how a similar case would be handled with AR. Well, there is good news on that side : it doesn’t produce chimeras.
As with other things ActiveRecord makes some choices for you and in this case it decides that, if you request a join query you want objects of the right hand side back.
# with ActiveRecord
Chapter.joins(:books)
# SQL :
# SELECT chapters.* FROM chapters INNER JOIN books ON books.id = chapters.book_id