What's in this column?
When you work with database tables, it often happens that you see column names and wonder “What’s in there?”. One would hope that a column would always be documented so the answer should be obvious, but in real life this is all too rarely the case, and even when it is the documentation may be wrong.
Level 1: Select
The very first thing you may do would be a basic SELECT my_column FROM my_table in your favourite DB tool such as DBeaver / pgAdmin. You will then see some values, 100-200 or more depending on the default query limit of your tool. If the column is something like title or description, you may notice that the values are nearly unique and have a clear semantics, and you can call it a day at this point. However, maybe the column is an obscure URID_v2 column and you will get back only a couple of values repeated over and over again, maybe even a single one dreadful ACC_43 or something. Clearly that query is not enough.
Level 2: Select Distinct
Let’s now step up, is ACC_43 the only value or are there more? You try a SELECT DISTINCT my_column FROM my_table. If you are lucky, you are now getting multiple unique values instead of a single repeated one. You now see some more values in there. Maybe you are able to make sense of it and finally understand the meaning of that column… or maybe not. Maybe you are even more confused because some values make sense but some don’t, you’d like to correlate them with their frequency or some other columns (in that case you may do a COUNT(*) ... GROUP BY my_column). Or maybe you are simply not getting anything. Maybe you just triggered a very expensive query without realizing and now you just see a spinner. After all, if you do not have so many values in that column, and if the column is not indexed in a way that allows an index-only scan, Postgres would have to do a full sequential scan of the whole table trying to find your unique values, and if you have a large table this may be way too expensive.
Level 3: Statistics
If your table is big and your distinct query is spinning around, there is still a way to get a good idea of the content of a table by looking at the column statistics:
SELECT
mcv.val AS val,
mcv.freq AS freq
FROM
pg_stats s
CROSS JOIN LATERAL unnest(s.most_common_vals::text::text[], s.most_common_freqs) AS mcv(val, freq)
WHERE
s.tablename = 'my_table' AND
s.attname = 'my_column';
Postgres automatically gathers statistics about the distribution of values in a column. Like all metadata, it is possible to query it using the built-in catalog tables / views, in this case pg_stats. By default, Postgres keeps track of the 100 most common values in a column and their frequencies. This is possible to configure either globally through default_statistics_target or on a per-column basis with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS. Note that changing those settings will not immediately gather statistics, it will only trigger when the next analyze run happens (either automatically or manually with ANALYZE). Either way, changing that setting is NOT the way to do if you just want to know what’s in a column and are not satisfied with a top 100 of values! Please don’t do that!
Level 4: Random Sample
If your table is very large, you want more than 100 values, that SELECT DISTINCT ... LIMIT x is not good enough because you want to have an idea of the distribution of values, what can you do? You can play with statistics and use a random sample! Postgres is nice enough to provide the perfect tool for that: the TABLESAMPLE clause. Let’s imagine we have a very very big table and want to get a better idea of the value distributions.
WITH column_sample AS (
SELECT my_column AS val
FROM my_table TABLESAMPLE SYSTEM (1) -- Will select a random sample of approximately 1% of the rows in the table
)
SELECT
val,
count(*)::float / (SELECT count(*) FROM column_sample)::float AS freq
FROM column_sample
GROUP BY val
ORDER BY freq DESC
This query will execute nearly 100x faster than a full “group by” query on the whole table. If the table is big enough, sampling 1% of it randomly should already give a reasonably accurate idea of the distribution of values. Depending on the query time, the table size and how many values you want, you can adjust the sampling percentage. If you have 10 billion rows, 0.001% could be enough!
Note that the SYSTEM clause here provides us with a very fast query but affects the randomness: Postgres is not in fact selecting at random 1% of the rows, but rather randomly selects 1% of the pages of rows. If your rows are very small (and thus lots of rows per page) and the table is not that big, it means that you may have a pretty biased sample. On the other hand, if that was the case, you wouldn’t really need a TABLESAMPLE to begin with. That method is mostly necessary for our purpose once the table is more than 100GB or so, which means 1% will represent hundreds of thousands of pages, so the tiny bias should not be an issue.