Surprise : I am mixing your data

When it comes to doing SQL queries onto a relational database there is a time when this query needs to use rows from two tables to give you the rows you want.

These are called "join" queries. Their purpose is to combines records from two tables. The usual diagrams shows two circles intersecting : an inner join query would return you what’s common to both circles (i.e. tables), the left (outer) join would give you what’s only in the left circle, and a right (outer) join would give you what’s only in the right circle.

The trick myself and some team members walked into this week is that those queries are meant to return you a mix of the rows of the two tables. Row by row, the ones matching are mixed together and returned.

Let's see how this happen, and how to avoid it.

Note :
As this post is fairly code oriented I decided to not record an audio version of it, apologies.

Preparing the trap

Let's make a minimal setup here : two ruby classes, the Sequel gem, a MySQL host.

require 'sequel'

MYSQL_DB = Sequel.connect(adapter: ‘mysql2’, host: ‘127.0.0.1’,  user: ‘root’, password: ‘password’, port: 3306, database: ‘join_test’)

class MySQLBook < Sequel::Model(MYSQL_DB[:books]); end
class MySQLChapter < Sequel::Model(MYSQL_DB[:chapters]); end

class DbSetup
  attr_reader :connector

  def initialize(connector)
    @connector = connector
    setup_db
    insert_data
  end

  private

  def setup_db
    if connector.table_exists?(:chapters)
      connector[:chapters].truncate
    else
      connector.create_table :chapters do
        primary_key :id
        String :name
        String :text
        String :published
        Integer :book_id
      end
    end
    
    if connector.table_exists?(:books)
      connector[:books].truncate
    else
      connector.create_table :books do
        primary_key :id
        String :name
        String :intro
        String :published
      end
    end
  end
  
  def insert_data
    books = connector[:books]
    books.insert(name: ‘Babc’, intro: “Lorem Ipsum is simply dummy text of the printing and typesetting industry.”, published: “yes”)
    books.insert(name: 'Bklm', intro: "Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).”, published: “no”)
  
    published_book_1 = books.where(name: ‘Babc’).first
    unpublished_book_1 = books.where(name: ‘Bklm’).first

    chapters = connector[:chapters]
    count = 1
    chapters.insert(name: “Chapter #{count}”, book_id: published_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
    count +=1
    chapters.insert(name: “Chapter #{count}”, book_id: published_book_1[:id], published: ‘no’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
    count +=1
    chapters.insert(name: “Chapter #{count}”, book_id: published_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
    count +=1
    chapters.insert(name: "Chapter #{count}", book_id: published_book_1[:id], published: 'no' , text: 'All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ')
    
    count = 1
    chapters.insert(name: "Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
    count +=1
    chapters.insert(name: “Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘no’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
    count +=1
    chapters.insert(name: “Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘yes’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
    count +=1
    chapters.insert(name: “Chapter #{count}”, book_id: unpublished_book_1[:id], published: ‘no’ , text: ‘All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ‘)
  end
end

DbSetup.new(MYSQL_DB)

We can save this in a file named db_setup.rb .

This code is fairly simple : we start by initialising a connector to MySQL through Sequel, then we define two classes inheriting from the Sequel::Model class, then we setup a pair of tables. The two classes are named MySQLChapter and MySQLBook as this article is based on the relation between chapters and books to show the specific case of join queries in Sequel.
This expects that a database was created within the MySQL host, and also notice the username and password here, you will need to adjust to your local setup.

The setup of the table creates the tables themselves with a few columns including a column name in both and a column book_id in the chapters table. Then we insert a few rows in both tables to have enough data to make sense.

To use this we need to install Sequel, so let’s prepare a simple Gemfile for this.

source 'https://rubygems.org'

ruby '2.6.5’

gem ‘sequel’
gem ‘mysql2’
gem ‘postgres’	# that's for later
gem ‘byebug’		# always handy

Once saved you can run bundle install and then bundle exec ruby setup.rb. This will connect to MySQL, create the tables and insert data. If the tables already exists they will get truncated (all rows removed) and rows will be reinserted.

A simple query first

Sequel is similar to ActiveRecord or other ORMs out there. But, just like DataMapper, it’s still different to it. Let’s see a couple of base queries.

First let’s get our favorite ruby inline console running. You can go with irb or pry for example.

In there let's require Sequel , initialize a simple connector to the MySQL DB and two classes. Basically the first part of the previous code listing.

require 'sequel'

MYSQL_DB = Sequel.connect(adapter: ‘mysql2’, host: ‘127.0.0.1’,  user: ‘root’, password: ‘password’, port: 3306, database: ‘join_test’)

class MySQLBook < Sequel::Model(MYSQL_DB[:books]); end
class MySQLChapter < Sequel::Model(MYSQL_DB[:chapters]); end

By using a model name you can directly run queries.

Counting all MySQLBooks in the table with the count class method :

[9] pry(main)> MySQLBook.count
=> 2

Getting the first MySQLBook with the first class method, and all of them with the all class method :

[10] pry(main)> MySQLBook.first
=> #<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>
[11] pry(main)> MySQLBook.all
=> [#<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>,
 #<MySQLBook @values={:id=>2, :name=>"Bklm", :intro=>"Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).", :published=>"no"}>]

And, here we can see the first different with ActiveRecord, getting a specific book by passing its id between square brackets :

[12] pry(main)> MySQLBook[1]
=> #<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>

The class method find is also available.

Getting a specific book with a specific title can be done with where :

[14] pry(main)> MySQLBook.where(name: "Babc")
=> #<Sequel::Mysql2::Dataset: "SELECT * FROM `books` WHERE (`name` = 'Babc')">

But this is not the row we are looking for, this is merely a ruby expression of the SQL query that will be run.

To get the query itself we can use the sql method added to this :

[15] pry(main)> MySQLBook.where(name: "Babc").sql
=> "SELECT * FROM `books` WHERE (`name` = 'Babc')"

And to get the rows themselves we need to use the all method (it will return an array), or the first method if we just want to get the first entry :

[16] pry(main)> MySQLBook.where(name: "Babc").all
=> [#<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>]

[17] pry(main)> MySQLBook.where(name: "Babc").first
=> #<MySQLBook @values={:id=>1, :name=>"Babc", :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry.", :published=>"yes"}>

As pointed out in the intro of this part : this is very similar to ActiveRecord.

Now, let's see how we can something more complex to work.

Join

SQL is a very powerful query language. It’s evolved for 50 years and is full of features. Some database products out there don’t entirely follow its standards but most do.

When one wants to filter rows of one table based on the values of the rows of another table they are related to one usually turns to either do several queries in a script or use a SQL JOIN.

Let's first see what multiple queries would look like. Using our chapters and books let’s list only chapters that are published and part of published books.

published_chapters = MySQLChapter.where(published: "yes").all

filtered_chapters = published_chapters.select do |chapter|
  book = MySQLBook.find(chapter.book_id)
  book.published == "yes"
end

The first line will get all published chapters, then we proceed by filtering out the ones that are not related to a published book.

Array#select
This method allows to go through each item within an array and return an array with items for which the block passed to select has returned true. There is a “mirror” to this method called reject.

This definitely works, there is one issue though : it leaves all this computation to be done on the ruby side. If there is only a few chapters and books to go through it’s ok, but if there is a lot of entries it will get costly.

The solution is to do this within one SQL query by using a JOIN.

SQL JOIN
Join queries are meant to return data from two tables that match a certain number of conditions. There are different types of join queries depending on what we want to match : inner join, left outer, right outer, full outer and cross.
The usual explanation involves two over lapping circles representing the two tables. The surface overlapped by both circles is used to represent the “inner join” : the data that match the query in both tables. As this is the type of Join we want here we are not going to cover more than that. See the links at the bottom of the article to get more details on this.

Expressed simply we want the published chapters related to published books. Sequel supports joins directly on models. Let’s try this.

[27] pry(main)> MySQLChapter.where(published: "yes").join(MySQLBook.where(published: "yes"), id: :book_id).all

Sequel::DatabaseError: Mysql2::Error: Column 'published' in where clause is ambiguous
...
Caused by Mysql2::Error: Column 'published' in where clause is ambiguous
...

The query is built as an extension of what we saw previously : MySQLChapter.where(published: "yes”). We add the join clause join(MySQLBook, id: :book_id) by passing the class we want to check against and a pair of keys to tell SQL what column to use to related the rows. Here there id column of the MySQLBook model should be matched with the book_id of the MySQLChapter one. Since we only wanted published books we also add a condition on the published column : MySQLBook.where(published: "yes"), id: :book_id.

This thows an exception as both tables have published column, SQL doesn’t know if it needs to use only one of the published columns, or both and in which order.

By the way this generates the following SQL :

[28] pry(main)> MySQLChapter.where(published: "yes").join(MySQLBook.where(published: "yes"), id: :book_id).sql
=> "SELECT * FROM `chapters` INNER JOIN (SELECT * FROM `books` WHERE (`published` = 'yes')) AS `t1` ON (`t1`.`id` = `chapters`.`book_id`) WHERE (`published` = 'yes')"

The solution is to attach the name of the related table to each reference to the column.

MySQLChapter.from(Sequel[:chapters].as(:chapter)).
	where(Sequel[:chapter][:published] => 'yes').
join(MySQLBook.from(Sequel[:books].as(:book)).
	where(Sequel[:book][:published] => 'yes'), id: :book_id).all

This is starting to get heavy (but we are not done yet). The trick is to use the Sequel[<table_name>].as(<custom_name>) and then Sequel[<custom_name>][<column_name>] syntaxes to name the tables and then attach them to the column names.

In practice this is what it does :

[29] pry(main)> MySQLChapter.from(Sequel[:chapters].as(:chapter)).where(Sequel[:chapter][:published] => 'yes').join(MySQLBook.from(Sequel[:books].as(:book)).where(Sequel[:book][:published] => 'yes'), id: :book_id).all

=> [#<MySQLChapter @values={:id=>1, :name=>"Babc", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1, :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry."}>,
 #<MySQLChapter @values={:id=>1, :name=>"Babc", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1, :intro=>"Lorem Ipsum is simply dummy text of the printing and typesetting industry."}>]

We get rows back and plenty of data. Yet … if one looks closely :

  • the class of each value is MySQLChapter; we want to handle chapters so, that’s what we expect.
  • the value of name for each result is the value of the name column for the book related to that chapter. This is not what we want. We can also see that the intro of the book is included in the result too.

This is due to the use of join. The inner join will give back a mix of data of the rows from both tables when they match together.

The fix

This is not an issue, it's actually a feature. But since we are handling models, it’s not very convenient to be getting attributes from the MySQLBook in the middle of instances of MySQLChapter.

The SQL query is a bit more obvious as to why this happens :

[30] pry(main)> MySQLChapter.from(Sequel[:chapters].as(:chapter)).where(Sequel[:chapter][:published] => 'yes').join(MySQLBook.from(Sequel[:books].as(:book)).where(Sequel[:book][:published] => 'yes'), id: :book_id).sql

=> "SELECT * FROM `chapters` AS `chapter` INNER JOIN (SELECT * FROM `books` AS `book` WHERE (`book`.`published` = 'yes')) AS `t1` ON (`t1`.`id` = `chapter`.`book_id`) WHERE (`chapter`.`published` = 'yes')"

The important point is the use of SELECT * FROM …. Strictly speaking this tells SQL to give you back all the columns that match the rest of the query.

To avoid this, one needs to tell SQL to only returns columns that are related to the chapters table : SELECT chapters.* FROM …. Sequel has it’s own way to let you do this : MySQLChapter.(Sequel[:chapters].as(:chapter)).select(Sequel.lit(‘chapter.*’))..

The full query is the following :

MySQLChapter.from(Sequel[:chapters].as(:chapter)).select(Sequel.lit('chapter.*')).
          where(Sequel[:chapter][:published] => 'yes').
          join(MySQLBook.from(Sequel[:books].as(:book)).
          where(Sequel[:book][:published] => 'yes'), id: :book_id)

And this returns the right results :

[32] pry(main)> MySQLChapter.from(Sequel[:chapters].as(:chapter)).select(Sequel.lit('chapter.*')).where(Sequel[:chapter][:published] => 'yes').join(MySQLBook.from(Sequel[:books].as(:book)).where(Sequel[:book][:published] => 'yes'), id: :book_id).all

=> [#<MySQLChapter @values={:id=>1, :name=>"Chapter 1", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1}>,
 #<MySQLChapter @values={:id=>3, :name=>"Chapter 3", :text=>"All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. ", :published=>"yes", :book_id=>1}>]

Looping back

The use of an ORM is often decried by some but one can’t overlook the fact that they have given many people a way to quickly build products that work. When and only if performance starts to matter we can look at refining the queries that are run.

Sequel is an interesting library to do small and big relational data models for a Ruby project. When needed you can absolutely do complex queries such as the ones above with it.
Yet, Sequel doesn't help much if you are not familiar with the SQL features you are trying to use. But that’s not a sin, thankfully you can go on the web and do plenty of reading on both SQL and Sequel to figure out a solution.
There is also a large community of people out there that will be happy to give you tips and tricks on those, or colleagues that are possibly more experienced and that can give you a hand or pointers to more documentation.

ORMs have hidden many complexities of SQL to myself for years and I am quite happy to stumble upon this kind of problems from time to time to refresh my memory. This one is kind of a text book example of SQL JOIN but the use of it with Sequel makes it a bit odd and leaves me with the impression that I might have been better off writing a plain SQL query install of something that has so much sugar that it’s barely readable compared to the vanilla SQL.

So, SQL or Sequel, that's up to you. Have fun.

Just one more thing ...

Users of ActiveRecord might wonder how a similar case would be handled with AR. Well, there is good news on that side : it doesn’t produce chimeras.
As with other things ActiveRecord makes some choices for you and in this case it decides that, if you request a join query you want objects of the right hand side back.

# with ActiveRecord
Chapter.joins(:books)
# SQL :
# SELECT chapters.* FROM chapters INNER JOIN books ON books.id = chapters.book_id

Need help ?

We specialise in helping small and medium teams transform the way they build, manage and maintain their Internet based services.

With more than 10 years of experience in running Ruby based web and network applications, and 6 years running products servicing from 2000 to 100000 users daily we bring skills and insights to your teams.

Wether you have a small team looking for insights to quickly get into the right gear to support a massive usage growth, or a medium sized one trying to tackle growth pains between software engineers and infrastructure : we can help.

We are based in France, EU and especially happy to respond to customers from Denmark, Estonia, Finland, France, Italy, Netherlands, Norway, Spain, and Sweden.

We can provide training, general consulting on infrastructure and design, and software engineering remotely or in house depending on location and length of contract.

Contact us to talk about what we can do : sales@imfiny.com.