Is `Array` the only option for storing elements in Ruby?

TLDR; Array, Set and SortedSet reflect various data structures created for different purposes. It’s good to know their pros and cons and, based on them, be able to select the one which fulfils concrete requirements best. Sometimes there is almost no difference, but there are cases where Set & SortedSet are much faster (and more suitable) than Array and vice versa.

Array class is one of the most commonly used Ruby class in day-to-day development. Enumerable module adds a comprehensive set of methods which makes playing with arrays a pleasant activity. I have just written set deliberately.

Next to the Array class there is also less known Set, which to be honest, I have not used by myself until today. In this article, I would like to share with you my findings about the classes based on a short example 🙂

Basic requirements and implementation

Let’s assume that we are building a brand new Ruby application (without any specific web framework). One of the requirements defined as a user story is written below:

As a user I want to be able to select a language in which content is displayed so that my friends from other countries can understand it as well.

To fulfil it we need to store all supported languages1 somewhere. There are multiple approaches to this problem, but for the sake of simplicity let’s agree that we are going to define Language class. It will be responsible for all the logic related to languages supported by the application.

class Language
end

To keep a list of the supported languages we can define a public class method called all. For the beginning, let’s assume that en is the only language we have content in.

class Language
  def self.all
      [:en] # It could be string as well.
  end
end

The method returns all supported languages:

Language.all
=> [:en]

My brain intuitively prompted me, that Array is a suitable data structure for that case in Ruby world.

Language.all.class
=> Array

Over time, as the application has been growing, other developers have extended the list with new elements:

class Language
  def self.all
    [:en, :de, :fr, :pl, :ro, :it, :nl]
  end
end

At this point I can see two potential problems:

  1. The array is defined in one line, what makes spotting a bug (e.g. a duplicated record) during code review process harder.
  2. The array is not sorted alphabetically. Therefore, it is easy to add the same language twice.

Let’s apply small changes:

class Language
  def self.all
    [
      :de,
      :en,
      :fr,
      :it,
      :nl,
      :pl,
      :ro,
    ]
  end
end

There are more lines of codes, but we are prepared better to avoid the potential issues. To gain even more confidence, we can take usage of sort and uniq methods defined in Enumerable mixin:

class Language
  def self.all
    [
      :de,
      :en,
      :fr,
      :it,
      :nl,
      :ro,
      :pl, # yiks, we added an unsorted element
      :de, # yiks, we added a duplicated element
    ].sort.uniq
  end
end

Thanks to it, the method always returns a sorted array of supported languages without duplicates. Are there any downsides of the above code? There are:

  1. sort & uniq methods return new array every time, so three objects are created instead of one.
  2. We did not freeze the array, so new elements (languages) can be pushed to it. It may lead to bugs.
languages = Language.all
=> [:de, :en, :fr, :it, :nl, :pl, :ro]

languages.push(:my_evil_language)
=> [:de, :en, :fr, :it, :nl, :pl, :ro, :my_evil_language]

Let’s try to address the above issues:

class Language
  def self.all
    [
      :de,
      :en,
      :fr,
      :it,
      :nl,
      :pl,
      :ro,
    ].uniq.sort!.freeze
  end
end

The code removes duplicates, sorts and freezes the array finally. Order of operations is important here. If we froze it first, we wouldn’t be able to modify it anymore:

[:de, :en, :pl].freeze.sort!
FrozenError (can't modify frozen Array)

Let’s check if our refactored method works as expected:

languages = Language.all
=> [:de, :en, :pl]

languages.push(:my_evil_language)
FrozenError (can't modify frozen Array)

At this point, we could stop polishing the method and it would fulfil the defined requirements well enough. But that is not the aim of this article 🙂

Say hi! to Set & SortedSet classes

Time for some fancy definition:

In mathematics, a set is a collection of distinct objects, considered as an object in its own right.2

The above sentence is true also in Ruby. Simply speaking, a set is an unordered array without duplicates.

Set[:de, :en]
=> #<Set: {:de, :en}>

Set[:de, :en, :de, :en] # duplicates are ignored
=> #<Set: {:de, :en}>

[:de, :en].to_set # Array has to_set method defined
=> #<Set: {:de, :en}>

Set.new([:de, :en])
=> #<Set: {:de, :en}>

The above snippet shows different ways of creating new Set objects. Furthermore, similar to Array class, Enumerable module is one of his ancestors:

Array.ancestors
=> [Array, Enumerable, Object, Kernel, BasicObject]

Set.ancestors
=> [Set, Enumerable, Object, Kernel, BasicObject

It means that we can use all the handy collection methods on Set objects:

set = Set.new([:en, :de])
=> #<Set: {:en, :de}>

sorted_set = set.sort
=> [:de, :en]

But wait, did you notice a subtle difference?

set.class
=> Set

sorted_set.class
=> Array

You may ask Why? The answer can be found in Ruby documentation:

Set implements a collection of unordered values with no duplicates.3

Unique values without order. As simple as that. Calling any method that breaks this rule returns Array object instead.

Applying Set to Language class

require 'set'

class Language
  def self.all
    Set[
      :de,
      :en,
      :fr,
      :it,
      :nl,
      :pl,
      :ro,
    ].sort.freeze
  end
end

Note, that for a set we need to use sort method, as sort! one is defined inside Array class only.

Does the above make the code any better? From one point we used a data structure created specially for storing unique values. From another, we ends with an array anyway and create two objects (one set and one array).

Can we do any better? SortedSet to the rescue! I literally discovered it when I was writing this article.

SortedSet implements a Set that guarantees that its elements are yielded in sorted order when iterating over them.4

SortedSet.ancestors
=> [SortedSet, Set, Enumerable, Object, Kernel, BasicObject]

A subclass of Set with elements in order, let’s try that.

require 'set'

class Language
  def self.all
    SortedSet[
      :en,
      :fr,
      :it,
      :nl,
      :pl,
      :ro,
      :de # an unordered element
    ].freeze
  end
end
languages = Language.all
=> #<SortedSet: {:de, :en, :fr, :it, :nl, :pl, :ro}>

languages.add(:my_evil_angiage)
FrozenError (can't modify frozen SortedSet)

Thanks to SortedSet we created only one object representing a class created for storying ordered elements with no duplicates. Voila 🙂

Summary

We can achieve (almost) the same result using Array, Set and SortedSet class. They are different data structures created for various reasons, though. A concrete class should be selected based on well-defined requirements. To make the final choice we need to be aware of their strengths and weaknesses. Asking yourself some questions helps as well:

  • How many elements will the data structure hold?
  • Will all the elements be returned every time or rather only one/a few based on some criteria (like checking if a language is supported by the application)?
  • and so on.

In the end, we might find that for one case it won’t make any difference (like holding only 10 static values), when for another a difference may be significant (like checking if an element is present in a data structure by using include? method).

If you have doubts, it’s always a good idea to do some benchmarking veryfing your assumptions:

require 'benchmark/ips' 
require 'set' 

elements = (1..100_000).to_a
array = [*elements].freeze
set = Set[*elements].freeze

Benchmark.ips do |x| 
  random_element = elements.sample
  x.report("Set.include?")   { 10000.times { set.include?(random_element) } }
  x.report("Array.include?") { 10000.times { array.include?(random_element) } } 
end
Warming up --------------------------------------
         Set.include?   109.000  i/100ms
       Array.include?     1.000  i/100ms
 Calculating -------------------------------------
         Set.include?      1.071k (±12.0%) i/s -      5.341k in   5.077456s
       Array.include?      0.214  (± 0.0%) i/s -      2.000  in   9.343202s

The above example clearly shows that searching for an element using include? method is much more effective on Set objects. Why? Because it uses Hash behind the scene which is faster for such cases than Array in Ruby. But that’s topic for another article 🙂

Treat Array, Set and SortedSet classes like specialised tools in your toolset. You should be aware that all three exist, know their limitations, and use them to solve your problems effectively 🙂

Footnotes

  1. Let’s ignore differences between locales and languages.
  2. https://en.wikipedia.org/wiki/Set_(mathematics)
  3. http://ruby-doc.org/stdlib-2.5.3/libdoc/set/rdoc/Set.html
  4. https://ruby-doc.org/stdlib-2.5.3/libdoc/set/rdoc/SortedSet.html
 

Igor Springer

I build web apps. From time to time I put my thoughts on paper. I hope that some of them will be valuable for you. To teach is to learn twice.

 

Leave a Reply

Your email address will not be published. Required fields are marked *