Learning About Streams in Elixir
Note: This post was originally called "Why not use Elixir Streams all the time?" Half-way through the writing of that post, I started googling this question and realized that there was already some research done and I quickly got a variety of good answers. With that said, I think I'll leave this post up, as it was fun to write (and it might be fun to read if you enjoy seeing someone think-out-loud-on-paper about new topics)!
Back already with another post about Elixir? Yes. Yes I am.
Today I'm asking myself the question: Any time we are transforming a list across multiple Enum
functions, why shouldn't we be using Streams if it's supposedly (on paper at least) ~~faster~~ more memory efficient / reduces iterations?
For those who are already more familiar with Elixir than myself (I'm still learning!), the following TL;DR might suffice based on what I gathered over the course of writing this post (and hey, reach out if I'm missing something!):
- Using streams doesn't automatically mean your list transformation pipeline is going to be faster or more memory efficient
- Streams seem generally recommended for really big or unknown data sources
- You could use a stream to read a huge file but exit once you find the first instance of something you want in that file (better than reading the whole thing into memory only to find what you need on the third line ;) )
- Streams let you start working right away with large data sources, rather than waiting for them to be opened into memory.
- You could also use streams to reduce multiple iterations of the transformation of a list (but arguably, this might sacrifice some readability).
- You might wish to look into streams only once you are actually finding bottlenecks in your application (don't optimize until you need to optimize :) ).
Some Backstory - What are streams?
I received a hard-copy of Elixir in Action (EiA) thanks to my work and have dived right into it. It is so nice to read and learn about programming subject matters without having to have a screen in front of my eyes! But that's a topic for another post.
In chapter three of EiA, we are introduced to various loops and iterations and more pertinent to this post - Streams. I don't know much about "Streams/Streaming"; it is a concept that has always been a little foggy (and perhaps is sometimes an overloaded term?). When I think of "streaming" I think of streaming a file bit by bit rather than opening it all at once. In EiA, we learn that in Elixir -
Streams are a special kind of enumerable that can be useful for doing lazy composable operations over anything enumerable. …
stream = [1, 2, 3] |> Stream.map(fn x -> 2 * x end)
Because a stream is a lazy enumerable, the iteration over the input list ([1, 2, 3]) and the corresponding transformation (multiplication by 2) haven't yet happened.
To make the iteration happen, you have to send the stream to an Enum function, such as each
, map
, or filter
. You can also use the Enum.to_list/1
function to which converts any kind of enumerable into a list:
Enum.to_list(stream) # => [2, 4, 6]
- 98-99 of EiA
Very cool! So perhaps a stream here is pretty analogous to a stream as I think I understand them. Just to be sure, why don't we take a peek at how Node.js defines streams - just to get some perspective from another language:
Using streams you read [a file, for example] piece by piece, processing its content without keeping it all in memory.
Why streams?
Streams basically provide two major advantages over using other data handling methods:
- Memory efficiency: you don't need to load large amounts of data in memory before you are able to process it
- Time efficiency: it takes way less time to start processing data, since you can start processing as soon as you have it, rather than waiting till the whole data payload is available
(Aside: one of the funny things about programming is that I can go years where I vaguely know what a term means, but until I actually encounter it head on as a tool I need/want to understand, it just lives in a foggy part of my head that I ignore! For the most part, I'm either avoiding something I think is going to be too complicated or I just don't need it as a solution for a thing I'm building.)
Ok, so the term "Stream" definitely maps here - we're doing lazy evaluation in Elixir; the use of a Stream function (Stream.filter
for example) provides a means for describing the computation to happen.
Further, chapter 2 of Elixir in Action makes special note that most functions that operate on Lists (either in the List
or Enum
module) are O(n)
operations - so things like running length(my_list)
will need to traverse the entierety of your list to figure out its length. This is something to be aware of if you are piping lists through a lot of Enum functions.
From Enum to Stream
It seems common in Elixir to come across code like this:
my_list
|> Enum.map(do_thing)
|> Enum.filter(remove_things_that_match_x)
|> Enum.with_index
Elixir in Action, if I'm understanding it correctly, posits that larger list transformations could perhaps be more efficient by using Streams:
my_list
|> Stream.map(do_thing)
|> Stream.filter(remove_things_that_match_x)
|> Stream.with_index()
|> Enum.to_list() # this is where stream iteration takes place
In the first example, Elixir would iterate over a list three times to run the lambdas for map and filter, and then to add the index to each list_entry. And so the question that kicked off this post came up: why not use streams all the time?
The subject is still a little foggy, but from the outside, Elixir's Streams are doing some magic behind the scenes (I would love to know how this works…) to make three things happen in a single iteration. How? I have no clue. For now, let's move onto benchmarking.
Naive Benchmarking
I tried wiring up my own little script to do some naive benchmarking. First, I grabbed a a 4mb+ plain text file that holds a list of words from the English dictionary. I wrote two contrived pipeline, one using Enum
and one using Stream
, then I found a library called Benchee to do the benchmarking.
defmodule Streaming do
def large_lines_stream!(path) do
File.stream!(path)
|> Stream.with_index()
|> Stream.filter(fn {line, _index} -> String.length(line) > 25 end)
|> map_it
end
def large_lines_enum!(path) do
File.stream!(path)
|> Enum.with_index()
|> Enum.filter(fn {line, _index} -> String.length(line) > 25 end)
|> map_it
end
defp map_it(lst) do
Enum.map(lst, fn {line, index} ->
if String.starts_with?(line, "a") do
{"#{line} <<<<<", index}
else
{line, index}
end
end)
end
end
Benchee.run(%{
"large_lines_stream" => fn -> Streaming.large_lines_stream!("./words.txt") end,
"large_lines_enum" => fn -> Streaming.large_lines_enum!("./words.txt") end
}, memory_time: 2)
Running the above spat out some nice results:
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.13.4
Erlang 24.3.3
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s
Benchmarking large_lines_enum ...
Benchmarking large_lines_stream ...
Name ips average deviation median 99th %
large_lines_stream 3.91 255.56 ms ±1.40% 254.61 ms 264.56 ms
large_lines_enum 3.44 290.34 ms ±9.95% 285.68 ms 379.26 ms
Comparison:
large_lines_stream 3.91
large_lines_enum 3.44 - 1.14x slower +34.78 ms
Here, Streams are about 1.14x faster than using Enum. I tried 10-x-ing the file size by copy+pasting the dictionary entries until I had about a 40 mb file. Not much different results. It ran a few milliseconds longer and had about the same percent difference between stream and enum.
Granted, I don't have much experience with Streams. I'm not really sure how big a file has to be before it should be approached with the Stream
module.
I think I've answered my own question (and have done some googling to see what other folks are saying). I'm not going to start just using Stream in list transformations unless I'm actually operating on data that either has an unknown size entirely, is "infinite" or is just really…really… big.
I feel like I took the long wayabout this question of mine, but I'm glad I did. I had fun writing it and learning about benchmarking along the way, too.
Additional Edits
I received some feedback on this post that clarified that streams are not necessarily faster but are more memory efficient. The feedback suggested that I include benchmarking memory usage - I hadn't noticed that benchee
could do this!
Here are some new results, running the same functions above but this time on a 400mb file I generated with another function.
❯ mix run lib/streams.exs
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.13.4
Erlang 24.3.3
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 18 s
Benchmarking large_lines_enum ...
Benchmarking large_lines_stream ...
Name ips average deviation median 99th %
large_lines_stream 0.0224 44.56 s ±0.00% 44.56 s 44.56 s
large_lines_enum 0.0216 46.25 s ±0.00% 46.25 s 46.25 s
Comparison:
large_lines_stream 0.0224
large_lines_enum 0.0216 - 1.04x slower +1.69 s
Memory usage statistics:
Name Memory usage
large_lines_stream 37.45 GB
large_lines_enum 37.45 GB - 1.00x memory usage -0.00000 GB
**All measurements for memory usage were the same**
tees/elixir-in-action/streams is 📦 v0.1.0 via 💧 v1.13.4 (OTP 24) took 6m39s
❯
Something definitely went wrong! I think I have more to learn about benchmarking and streams before I continue. I think I'll pause here and perhaps come back to this someday!
Thanks for reading o/
WT
Additional Readings
- Should one use Streams as much as possible? : r/elixir
- Stream vs Enum benchmark : r/elixir
- performance - Should I ever prefer Enum to Stream in Elixir?
I posted this article here and got some great feedback and helpful corrections.