Splitting comma seperated fields into multiple rows in SQLite

I have been using SQLite occasionally to work with large (>100GB) datasets. Recently I had to split out a field that had comma seperated values into multiple rows. My SQL and database skills is not very mature, but I am learning new things all the time (eg. don’t forget to index columns that you need), so its been a fun addition to my toolkit.

In this particular case, I was working on multi-label classification problem (each example can have more than one label). The labels were stored as a comma seperated values in the same column. I needed to get (id, label) pairs out from the database. Postgres has a convenient method regexp_split_to_table which can be used to split the column using commas. SQLite (being lite), does not have such a function though so we got to implement it ourselves.

Lets take a look at how the data looks like

sqlite> SELECT eid,labels FROM examples;
e0000000|label1,label2,label3
e0000001|label1
e0000002|label2,label4

Since we need a query that can run on an unknown number of labels per row, we can use a Common Table Expression (CTE) to create a recursive query that splits the strings one substring at a time.

-- query.sql
WITH RECURSIVE split(eid, label, str) AS (
    SELECT eid, '', labels||',' FROM examples
    UNION ALL SELECT
    eid,
    substr(str, 0, instr(str, ',')),
    substr(str, instr(str, ',')+1)
    FROM split WHERE str!=''
) 
SELECT eid, label
FROM split
WHERE label!='';

Each time a string is passed into the above query, it splits the string at the first comma it finds. It returns the left part of the string as a label and the right part of the string is passed into the next iteration of the recursive query to be processed again.

Given label1,label2,label3, the first iteration of the CTE will return label1 and run the recursive CTE on label2,label3 and so on. The rest of the code is handling minor things such as ensuring there is a comma after every label.

$ sqlite3 data.db < query.sql
e0000000|label1
e0000000|label2
e0000000|label3
e0000001|label1
e0000002|label2
e0000002|label4

The query produces exactly what we want. We can now run hyperparameter tuning to overfit on the validation set!

CTE’s are really powerful and are extensively used to traverse graphs (yes you heard that right, you dont need graph databases to store graphs). Read more about them from the SQLite documentation.