<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[Tiger Data Blog]]></title>
        <description><![CDATA[Insights, product updates, and tips from TigerData (Creators of TimescaleDB) engineers on Postgres, time series & AI. IoT, crypto, and analytics tutorials & use cases.]]></description>
        <link>https://www.tigerdata.com/blog</link>
        <image>
            <url>https://www.tigerdata.com/icon.ico</url>
            <title>Tiger Data Blog</title>
            <link>https://www.tigerdata.com/blog</link>
        </image>
        <generator>RSS for Node</generator>
        <lastBuildDate>Tue, 07 Apr 2026 13:09:59 GMT</lastBuildDate>
        <atom:link href="https://www.tigerdata.com/blog" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>
        <item>
            <title><![CDATA[Best Practices for Picking PostgreSQL Data Types]]></title>
            <description><![CDATA[Learn which data types best suit your application when storing massive data volumes in PostgreSQL and TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/best-practices-for-picking-postgresql-data-types</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/best-practices-for-picking-postgresql-data-types</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[PostgreSQL Tips]]></category>
            <dc:creator><![CDATA[Chris Engelbert]]></dc:creator>
            <pubDate>Wed, 23 Aug 2023 15:43:17 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2023/08/PostgreSQL-Data-types.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>A good, future-proof data model is one of the most challenging problems when building applications. That is specifically true when working on applications meant to store (and analyze) massive amounts of data, such as time series, log data, or event-storing ones. </p><p>Deciding what data types are best suited to store that kind of information comes down to a few factors, such as requirements on the precision of float-point values, the actual values content (such as text), compressibility, or query speed.</p><p>In this installment of the best practices series (see our posts on <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">narrow, medium, and wide table layouts</a>, <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables/">single or partitioned hypertables</a>, and <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-metadata-tables/" rel="noreferrer">metadata tables</a>), we’ll have a look at the different options in PostgreSQL and TimescaleDB regarding most of these questions. While unable to answer the requirement question, a few alternatives may be provided (such as integers instead of floating point)—but more on that later.</p><h1 id="before-we-start-compression">Before We Start: Compression</h1><p>Event-like data, such as time series, logs, and similar use cases, are notorious for ever-growing amounts of collected information. Hence, it’ll grow continuously and require disk storage. </p><p>But that’s not the only issue with big data. Querying, aggregating, and analyzing are some of the others. Reading this amount of data from disk requires a lot of I/O operations (<a href="https://en.wikipedia.org/wiki/IOPS">IOPS; input/output operations per second</a>), which is one of the most limiting factors in cloud environments, and even on-premise systems (due to how storage works in general). While non-volatile memory express (NVMes) transfer protocols and similar technologies can help you optimize for high IOPS, they’re not limitless.</p><p>That’s where compression comes in. TimescaleDB’s compression algorithms (and, to some extent, the default PostgreSQL’s) help decrease disk space requirements and IOPS, improving cost, manageability, and query speed.</p><p>But let’s get to the actual topic: best practices for data types in TimescaleDB.</p><h2 id="basic-data-types">Basic Data Types</h2><p>PostgreSQL (and the SQL standard in general) offers a great set of basic data types, providing a perfect choice for all general use cases. However, you should be discouraged from using them. The <a href="https://wiki.postgresql.org/wiki/Don't_Do_This">PostgreSQL Wiki</a> provides a great list of best practices regarding data types to use or avoid. Anyhow, you don’t immediately have to jump over; we’ll cover most of them here 🥹.</p><h3 id="nullable-columns">Nullable columns</h3><p>When looking back at the <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">best practices on table layout</a>, medium and wide table layouts tend to have a few too many nullable columns, being placeholders for potential values.</p><p>Due to how PostgreSQL stores nullable values that are <code>NULL</code>, those are almost free. Having hundreds of nullable columns, most being <code>NULL</code>, is not an issue. The same is true for TimescaleDB’s custom compression. Due to storing data in a columnar format, empty row values are almost free when compressed (null bitmap).</p><h3 id="boolean-values">Boolean values</h3><p>A <em>boolean</em> value is a logical data type with one of two possible values, <code>TRUE</code> or <code>FALSE</code>. It is normally used to record decisions or states.</p><p>There isn’t much specific to booleans in TimescaleDB. They are a very simple data type but also a great choice. Still, people often use an <em>integer</em> to represent their values as <code>1</code> or <code>0</code>. This may come in handy with narrow or medium table layouts where you want to limit the number of columns. Nothing speaks against either solution!</p><p>In terms of compressibility, booleans aren’t heavily optimized but compress fairly well with the standard compression. If you have a series of states, it may be recommended to only store state changes, though, removing duplicates from the dataset.</p><h2 id="floating-point-values">Floating-point values</h2><p>Floating-point data types represent real numbers, most often decimal floating points of base ten. They are used to store all kinds of information, such as percentages, measurements like temperature or CPU usage, or statistical values.</p><p>There are two groups of floating-point numbers in PostgreSQL, <em>float4</em> (a.k.a., <em>real</em>), <em>float8</em> (<em>double precision</em>), and <em>numeric</em>.</p><p><strong><em>Float4</em> and <em>float8</em> columns are the recommended data types.</strong> TimescaleDB will handle them specifically (during compression) and optimize their use. On the other hand, <em>numeric</em>, as an arbitrary precision-sized data type, isn’t optimized at all. <strong><em>Numeric</em> isn’t recommended.</strong></p><p>In general, though, due to the complexity of floating-point numbers, if you know the required precision upfront, you could use the multiply-division trick and store them as integers, which are better optimized. For example, consider we want to store a temperature value (in Kelvin) and only two decimal places, but the value comes in as a <em>float4</em>.</p><pre><code class="language-plain">float4 originalValue = 298.151566;
int storedValue = (round(originalValue * 100))::int;
float4 queryValue = storedValue::float4 / 100;
</code></pre>
<p>It is a trick often used in data transmission for embedded devices with a low throughput uplink to limit the number of bytes sent.</p><h3 id="integer-values">Integer values</h3><p>Integer data types represent natural numbers of various sizes (meaning valid number ranges, depending on how many bytes are used to represent them). Integer values are often used for simple values, such as counts of events or similar.</p><p><strong>All integer types (<em>int2, SmallInt</em>, <em>int4, Integer</em>, <em>int8, BigInt</em>) are recommended data types.</strong> TimescaleDB is heavily optimized to compress the values of those data types. No less than three compression algorithms are working in tandem to get the most out of these data types. </p><p>This is why it is advised (if you know the necessary precision of a floating-point value) to store the values as integers (see <a href="https://docs.google.com/document/d/13OXvYi5QA5zVgu3fmY9mN1RUt7HWpkm_rYBDgnIcLfM/edit#heading=h.e2lgb9ieofml">floating-point values</a> for more information).<br><br>What is true for integers is also true for all serial data types (<em>serial2, SmallSerial</em>, <em>serial4, Serial</em>, <em>serial8, BigSerial</em>), as those are magical “aliases” for their integer counterparts, incorporating the automatic creation of sequences to fill in their values on insert. </p><p>That said, they use their corresponding integer data types as a column data type. Anyhow, the PostgreSQL best practices advise against using them and recommend using <a href="https://www.postgresql.org/docs/current/sql-createtable.html">identity columns</a> instead for anything PostgreSQL from version 10 onwards.</p><h3 id="timestamp-time-and-date-values">Timestamp, time, and date values</h3><p>Timestamps and time and date data types represent a specific point in time, some with more and some with less explicit information. All these data types have versions with and without timezone information attached (except <em>date</em>).</p><p><strong>Before going into details, I’d generally advise against any of the data types without timezones (timestamp without time zone, timestamp, time without time zone, time).</strong> Most of them are discouraged by PostgreSQL’s best practices and shouldn’t be used. </p><p>It’s a misconception that it would save any storage space, as many believe, and TimescaleDB doesn’t have any optimization for data types without timezones. While it works, it will add a lot of casting overhead which is implicit and, therefore, not immediately visible. That said, just don’t do it 🔥.</p><p>With that out of the way, you can use <em>date</em> and <em>time</em>, but you should consider the use case. While <em>date</em> is optimized and compressed using the same compression scheme as integers, <em>time</em> is not. In any case, you shouldn’t use both for the time-dimension column. </p><p>To store dates, you should consider using <em>timestamptz</em> with the time portion set to midnight in the necessary timezone (<code>2023-01-01T00:00:00+00</code>) to prevent any casting overhead when querying.</p><p>Likewise, you can use <em>timestamptz</em> to store a time value only. In this case, you encode the time portion to a specific date (such as <code>1970-01-01T15:01:44+04</code>) and cast the final value back into a <em>time</em> value. Alternatively, you can store the value as an <em>integer</em> by encoding the time into the (nano)seconds since midnight or any other encoding you can come up with.</p><p><strong>That leaves us with <em>timestamptz</em> (<em>timestamp with time zone</em>). You’ve guessed it: this is the recommended data type for any kind of point-in-time storage.</strong> </p><p>This data type is highly optimized, used by all internal functionality, and employs the same compression as integers. Said compression is especially effective with timestamps, as they tend to have little difference between two consecutive values and compress extremely well. </p><p>Still, be aware that some frameworks, object-relational mappings (ORMs), or tools love their <em>timestamp without time zone</em> and need to be forced to be good citizens.</p><h3 id="text-values">Text values</h3><p>Text values are used to store textual values of arbitrary size. Those values can include detailed descriptions, log messages, and metric names or tags. The available data types include <em>text</em>, <em>char(n)</em>, and <em>varchar(n)</em>.</p><p>PostgreSQL’s best practices advise against using <em>char(n)</em>, as it will pad values shorter than <em>n</em> to that size and waste storage. It recommends using <em>text</em> instead.</p><p>The same is true with <em>varchar(n)</em> with a length limit. Consider using <em>varchar</em> (without length limit) or <em>text</em>.</p><p>From a TimescaleDB-specific perspective, there isn’t much to say except you may want to deduplicate long values using a separate table holding the actual value and a reference (such as a checksum on the content) and storing the reference in the hypertable.<br><br>TimescaleDB doesn’t offer any specific optimization to handle this type of data. It will, however, apply dictionary compression (lz4-based) to those text fields.</p><h3 id="byte-array-bytea-values">Byte array (bytea) values</h3><p>Byte arrays (in PostgreSQL represented by the data type <em>bytea</em>) store arbitrary large sequences of bytes, which may represent anything, from encoded machine state to binary data packets.</p><p>When looking at customer/user use cases, it is a very uncommon data type, as most data is decoded before being stored in the database. Therefore, TimescaleDB doesn’t optimize anything about this data type. Compression, however, should use the lz4-based compression.</p><p>If you have recurring, large <em>bytea</em> values, you can store them outside the actual hypertable and apply deduplication, as explained for text columns.</p><h2 id="complex-and-extension-data-types">Complex and Extension Data Types</h2><p>Compared to basic data types, complex data types commonly encode multiple values into a single column. This may include custom composite types.</p><h3 id="structural-types-json-jsonb-xml">Structural types (JSON, JSONB, XML)</h3><p>Structural data types encode complete objects or sets of information, often <a href="https://www.timescale.com/learn/what-is-data-compression-and-how-does-it-work">lossy</a> in terms of data types of the actual values. PostgreSQL supports three structural data types, <em>JSON</em>, <em>JSONB</em> (a binary representation of JSON), and <em>XML</em>. </p><p>Values often contain complex state information from machines or sensors. They are also common with narrow table layouts to compensate for the different values that need to be stored.</p><p><strong>To get it out of the way, if you want to store JSON-like data, don’t use <em>JSON</em>—use <em>JSONB</em>!</strong> It’s better regarding storage space, query speed, and anything you can think of. </p><p>The only disadvantage of <em>JSONB</em> is that you lose the original order of properties since <em>JSONB</em> will decompose the object for better efficiency due to the way it is stored internally. Anyhow, not sure I can come up with a great reason for why the order should matter, and I hope you agree 😉.</p><p><strong>That said, <em>JSONB</em> is a valid choice when storing complex data.</strong> The amount of stored data stored should be kept under close observation, though. If you have recurring, large amounts of data inside the <em>JSONB</em> objects, it may be advisable to extract that meta information into a separate, vanilla PostgreSQL table and join them in at query time. An approach is shown in the <a href="https://docs.google.com/document/d/13OXvYi5QA5zVgu3fmY9mN1RUt7HWpkm_rYBDgnIcLfM/edit#heading=h.r3us7sct6gi4">text values</a> section.</p><p>For the <em>XML,</em> I don’t have enough data to give any recommendations. Actually, I cannot remember anyone ever asking anything about it. Due to my scarce experience, I wouldn't advise using it. I guess it may behave similarly to <em>JSONB</em>, but that’s reading clouds.</p><p>For all the above data types, TimescaleDB won’t apply any specific optimizations, and compression will likely use the dictionary algorithm, which still yields impressive results (been there, done that).</p><p>There are a few things to be wary of when using those data types. Remember that the whole object needs to be read, parsed, and the requested path’s value extracted. That happens for every single row of data being queried. </p><p><br>One thing I learned using JSONB is that <a href="https://www.postgresql.org/docs/current/indexes-expressional.html">indexes on expressions</a> and <a href="https://www.postgresql.org/docs/current/gin-intro.html">GIN indexes</a> are your friends if you need to make selection decisions based on data inside the object (such as tags). Ensure that anything you need to make that decision is in an index to read and (potentially) decompress the object.</p><p><br>The second element to remember is that extracting values from those objects yields text values that need to be cast into the required data type. Even if the value is stored as a number inside the <em>JSONB</em> object, extracting it returns the text representation, which then needs to be cast back into an integer or floating-point value, adding a lot of casting overhead to the process.</p><h2 id="universally-unique-identifiers-uuid">Universally Unique Identifiers: UUID</h2><p>Using <em>UUID</em> to store unique identifiers is common in the PostgreSQL world. They are used for various reasons, including simplifying the generation of highly unlikely and colliding values on a different system than the database for “security” reasons (meaning removing potentially guessable series of numbers) and others. </p><p>A <em>UUID</em> represents a (semi)random 128-bit number in the format specified in <a href="https://tools.ietf.org/html/rfc4122">RFC 4122</a> (ISO/IEC 9834-8:2005), which looks similar to <code>a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11</code>.<br><br>TimescaleDB doesn’t optimize this data type at all. It will compress it using the dictionary compression algorithm. However, there are other elements to keep in mind when using <em>UUID</em>. While they offer many advantages and simplify distributed systems, they also create some locality issues, specifically with BTree indexes. You will find a great read on this topic in the <a href="https://www.cybertec-postgresql.com/en/unexpected-downsides-of-uuid-keys-in-postgresql/">Cybertec blog</a>.<br><br>My recommendation would be to think twice about <a href="https://www.google.com/url?q=https://pgxn.org/dist/pg_uuidv7/&amp;sa=D&amp;source=docs&amp;ust=1692628210856556&amp;usg=AOvVaw05jaIePFZVsEcE052A7jpD">using <em>UUID</em>s</a>.</p><h2 id="tree-like-structures-ltree">Tree-like structures: Ltree</h2><p>While <em>ltree</em> is an uncommon data type when storing event-like data, it can be a great fit for deduplicating recurring, larger values. <em>Ltree</em> represents a certain (sub)path in a tree, illustrating a specific device in an overarching topology. Their values resemble a canonical, dot-separated path (<code>building1.floor3.room10.device12</code>).</p><p>Values of data type <em>ltree</em> aren’t handled by TimescaleDB specifically. However, due to their size, this isn’t a big deal.</p><p>As mentioned above, <em>ltree</em> values are perfect for deduplicating recurring data, such as meta information, which are unlikely or slowly changing. Combined with a hash/checksum, they can be used as a reference to look them up in an external table.</p><h2 id="key-value-pairs-hstore">Key-value pairs: <code>hstore</code></h2><p><em><code>hstore</code></em> provides the capability to store key-value pairs of data, similar to what a non-nested <em>JSON(B)</em> object would offer.</p><p>From my experience, only a few people use <em>hstore,</em> and most prefer <em>JSONB</em>. One reason may be that <em>hstore</em> only offers text values. That said, there is no experience in compression gains or speed implications. I guess a lot of what was noted on <em>JSONB</em> objects would hold for <em>hstore</em>.</p><h2 id="postgis-data-geometries-and-geographies">PostGIS data: Geometries and geographies</h2><p>Last but not least, <a href="https://postgis.net/">PostGIS’</a> data types are like <em>geometry</em> and <em>geography</em>. PostGIS data types are used to store special information of any kind. Typical use cases are GPS positions, street information, or positional changes over time.</p><p><a href="https://www.timescale.com/learn/postgresql-extensions-postgis">TimescaleDB works perfectly fine with PostGIS data types</a> but doesn’t optimize them. Compression ratios, while not perfect, are still decent.</p><p>Not a lot to mention here: they work, and they work great.</p><h2 id="what-is-the-recommendation-now">What Is the Recommendation Now?</h2><p>Recommending data types is hard since many decisions depend on requirements outside the scope of this blog post. However, I hope this read provides insight into which data types to avoid, which ones are fine, and how you may optimize okay data types.</p><p>To summarize, here’s a small table showing different aspects and categorizing them into four states: great 🟢, okay, but be wary of something 🟠, or avoid at any cost 🔴, as well as unknown (feel free to provide experience/feedback) ⚪️.</p><table>
<thead>
<tr>
<th style="text-align:right"><strong>Data Type</strong></th>
<th style="text-align:center"><strong>Query Speed</strong></th>
<th style="text-align:center"><strong>Compressibility</strong></th>
<th style="text-align:center"><strong>Recommended</strong></th>
<th><strong>Alternative</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:right">boolean</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">real, float4</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">double (precision), float8</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">numeric</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🔴</td>
<td>floats, integers</td>
</tr>
<tr>
<td style="text-align:right">smallint, int2</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">integer, int4</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">bigint, int8</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">smallserial, serial2</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td>int2 with identity</td>
</tr>
<tr>
<td style="text-align:right">serial, serial4</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td>int4 with identity</td>
</tr>
<tr>
<td style="text-align:right">bigserial, serial8</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td>int8 with identity</td>
</tr>
<tr>
<td style="text-align:right">timestamp without time zone, timestamp</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">timestamp with time zone, timestamptz</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">time</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">timetz</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">date</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🔴</td>
<td>timestamptz</td>
</tr>
<tr>
<td style="text-align:right">char(n)</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🔴</td>
<td>text</td>
</tr>
<tr>
<td style="text-align:right">varchar(n)</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🔴</td>
<td>text</td>
</tr>
<tr>
<td style="text-align:right">text</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">bytea</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">json</td>
<td style="text-align:center">🔴</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🔴</td>
<td>jsonb</td>
</tr>
<tr>
<td style="text-align:right">jsonb</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">xml</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td>jsonb</td>
</tr>
<tr>
<td style="text-align:right">uuid</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟠</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">ltree</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
<tr>
<td style="text-align:right">hstore</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td style="text-align:center">⚪️</td>
<td>jsonb</td>
</tr>
<tr>
<td style="text-align:right">geometry, geography (PostGIS)</td>
<td style="text-align:center">🟢</td>
<td style="text-align:center">🟠</td>
<td style="text-align:center">🟢</td>
<td></td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2><p>If you want to know more about the compression algorithms used, see the blog post one of my colleagues wrote, <a href="https://timescale.ghost.io/blog/time-series-compression-algorithms-explained/">Time-Series Compression Algorithms, Explained</a>.</p><p>Also, if you have any feedback or experience with any of the above data types or the non-mentioned ones, I’d be happy to hear from you on <a href="https://twitter.com/noctarius2k">Twitter</a> or our <a href="https://slack.timescale.com">Community Slack</a> (@Chris Engelbert)!</p><p>Finally, I’m sorry for the wall of text you’ve had to read to get here, but there’s a lot of information to share and many different data types (and those aren’t even all of them). I hope you enjoyed the read and learned something new along the way.<br><br>If you want to test Timescale right now, the easiest and fastest way to get started is to sign up for our <a href="https://console.cloud.timescale.com/signup">30-day Timescale free trial</a>. To try self-managed TimescaleDB, see the <a href="https://docs.timescale.com/self-hosted/latest/">documentation</a> for further information.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Best Practices for Time-Series Data Modeling: Single or Multiple Partitioned Table(s) a.k.a. Hypertables]]></title>
            <description><![CDATA[Time-series data is relentless, so you know you’ll have to create one or more partitioned tables (a.k.a. Timescale hypertables) to store it. Learn how to choose the best data modeling option for your use case—single or multiple hypertables.]]></description>
            <link>https://www.tigerdata.com/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/best-practices-for-time-series-data-modeling-single-or-multiple-partitioned-table-s-a-k-a-hypertables</guid>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Hypertables]]></category>
            <dc:creator><![CDATA[Chris Engelbert]]></dc:creator>
            <pubDate>Thu, 09 Mar 2023 14:00:05 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2023/03/time-series-data-modeling-single-or-multiple-partitioned-tables_hero.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>Collecting <a href="https://timescale.ghost.io/blog/time-series-data/">time-related information</a>, or time-series data, creates massive amounts of data to manage and model. Storing it will require one or more Timescale hypertables, which are very similar to PostgreSQL partitioned tables.</p><p><a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/">Timescale hypertables work like regular PostgreSQL tables</a> but offer optimized performance and user experience for time-series data. With hypertables, data is stored in chunks, which work similarly to <a href="https://www.postgresql.org/docs/current/ddl-partitioning.html">PostgreSQL’s partitioned tables</a> but support multiple dimensions and other features. While we discussed the table layout in the <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/"><em>Narrow, Medium, or Wide Table Layout</em></a> best practices article, this time, we addressed whether you should use a single table to store all data versus using multiple tables, as well as their respective pros and cons.</p><p><a href="https://timescale.ghost.io/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/">Timescale is built upon a relational database model</a>, which means it supports numerous data modeling choices or ways in which data can be organized and laid out. Understanding the database design choices early on is crucial to find the best combination.</p><p>I started using Timescale long before joining the company, initially using it to store IoT metrics at my own startups. We went through a few different iterations of designs, and migrations between those were everything but fun. Due to that personal experience, one of my biggest goals is to prevent others from suffering through the same.</p><h1 id="time-series-data-modeling-using-our-relational-database-experience">Time-Series Data Modeling: Using our Relational Database Experience</h1>
<p>As mentioned, Timescale uses a relational data model at its core. Being built on PostgreSQL, we understand we have many ways to store data, including in partitioned tables or just separate ones. A common pattern in the relational world is to divide data into separate tables based on their content, also often referred to as <a href="https://opentextbc.ca/dbdesign01/chapter/chapter-7-the-relational-data-model/"><em>domain</em></a> or <em>entity</em>. That means that data belonging to a set of A’s is stored in a different table than a set of B’s, representing different <em>domains</em>.</p><p>That leaves us questioning how the concepts of time-series data and relational domains fit together. Unfortunately, there is no easy answer. Our primary options are “squeezing” all data into a single table, which could have hundreds of columns (basically defining our domain around the idea of “it’s all just metrics”), or splitting data into multiple tables with fewer columns. The latter choice may slice tables in many ways, such as by metric type (temperature is different from humidity, stock symbol A is different from symbol B), customer, data type, and others, or combinations of the previous.</p><p>Both possibilities have their own set of advantages and disadvantages, which can be split into four commonly seen topics:</p>
<ol>
<li>Ease of use</li>
<li>Multi-tenancy / Privacy-related requirements (General Data Protection Regulation or GDPR / California Consumer Privacy Act or CCPA / others)</li>
<li>Schema migration or upgrading / Future-proofness</li>
<li>Tooling support</li>
</ol>
<p>I did not choose the above order by accident: the sequential importance of these questions may influence your options further down the line.</p><h1 id="pros-and-cons">Pros and Cons</h1>
<p>As said before, both design choices have pros and cons, and it’s vital to understand them before making a final data modeling decision. Given the previous set of topics, let’s start by answering a few questions.</p><p>First and foremost, how important is the ease of use, meaning, are you and your team up to the challenging tasks that need to be solved down the road? Potential “complications” could involve generation patterns for table names or ensuring that similar tables are all upgraded to the same schema level.</p><p>Next up, are you required to provide a harder level of multi-tenancy, such as storing different customers not just by segregating them using a customer ID but are required to store them in different tables, schemas, or even databases? Is your company bound by regulations (e.g., GDPR or CCPA) where users and customers may have the right to be forgotten? With time-series data being (normally) append-only, removing parts of the data (this specific user’s data) may be tricky.</p><p>Then we have the question of whether you expect the data schema to change frequently. A large discussion around a future-proof design for hypertables can be found in the <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/"><em>Narrow, Medium, or Wide Table Model</em></a> best practices write-up. However, the higher the number of tables, the more they need to be upgraded or migrated in the future, adding additional complexity.</p><p>Finally, how important is support by additional tools and frameworks, such as ORM (Object Relational Mapping) solutions? While I personally don’t think ORM frameworks are a great fit for time-series data (especially when using aggregations), a lot of folks out there make extensive use of them, so talking about them has its merits.</p><p>Anyhow, now that we answered those questions, let’s dig into the design choices in greater detail.</p><h1 id="single-table-designs">Single Table Designs</h1>
<p>Storing all data into a single table may initially feel irresponsible from a relational database point of view. Depending on how I slice my domain model, though, it could be a perfectly valid option. The design choice makes sense if I consider all stored data (metrics, events, IoT data, stock prices, etc.) as a single domain, namely time series.</p><p>Single tables make a few things super simple. First of all, and probably obvious to most, is querying. Everything is in the same table, and the queries select certain columns and add additional filters or where clauses to select the data. That is as basic as it can get with SQL. That said, querying data is super simple, not just easy.</p><p>Upgrading the table’s schema is equally simple. Adding or removing columns implies a single command, and all data is upgraded at the same time. If you have multiple similar tables, you may end up in a situation where some tables are upgraded while others are simply forgotten—no real migration window is needed.</p><p>Single tables can easily support multiple different values, either through multiple columns (wide table layout), a JSONB column that supports a wide range of data values (narrow table layout), or through columns based on a value’s potential data type (medium table layout). <a href="https://timescale.ghost.io/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/">Those three options have pros and cons, though</a>. </p><pre><code>tsdb=&gt; \d
      Column    |           Type           | Collation | Nullable |      Default
—---------------+--------------------------+-----------+----------+-------------------
 created        | timestamp with time zone |           | not null | now()
 point_id       | uuid                     |           | not null | gen_random_uuid()
 device_id      | uuid                     |           | not null |
 temp           | double precision         |           |          |
 hum            | double precision         |           |          |
 co2            | integer                  |           |          |
 wind_speed     | integer                  |           |          |
 wind_direction | integer                  |           |          |

|                  created | point_id | device_id | temp |  hum | co2 |
 wind_speed | wind_direction |
| 2022-01-01 00:00:00.0+00 |      123 |        10 | 24.7 | 57.1 | 271 |
       NULL |           NULL |
</code></pre>
<p>Last but not least, single tables play very nicely with tools like ORM frameworks. It is easy to connect a specific set of ORM entities to the hypertable, representing either a full record or the result of an aggregation (which may need a native query instead of an automatically generated one).</p><p>But as with everything in this world, this choice has a large downside. Since time-series data is designed around the idea of being append-only (meaning that mutating existing records only happens occasionally), it is hard to delete data. Deleting data based on specific requirements is even harder, such as a user or customer asking to have all their data deleted.</p><p>In that situation, we’d have to crawl our way through potentially years of data, removing records from all over the place. That’s not only a burden on the WAL (Write-Ahead Log) to keep track of the changes, but it also creates loads and loads of I/O, reading and writing.</p><p>The same is true if we try to store collected and calculated sets of data in the same table. With many systems often having to backfill data (for example, from devices that lost their internet connection for a while and were collecting data locally), calculated values may have to be recalculated. That means that the already stored data must be invalidated (which may mean deleted) and reinserted.</p><p>Finally, if your company provides different tiers of data retention, good luck implementing this on a single table. It is the previous two issues, but on a constant, more than ugly, basis.</p><h1 id="multiple-table-designs">Multiple Table Designs</h1>
<p>Now that we know about the pros and cons of single table design, what are the differences when we aim for multiple table designs instead?</p><p>While querying is still simple, querying multiple sets of data simultaneously may be slightly more complicated, involving <code>JOINs</code> and <code>UNIONs</code> to merge data from the different tables. Requesting multiple sets of data at the same time is often done for efficiency reasons, requiring fewer database round trips and minimizing response time. Apart from that, there isn’t a massive difference in ease of use, except for table names, but we’ll come back to that in a second.</p><p>One of the major benefits of having multiple tables, especially when sliced by the customer, user, or whatever meaningful multi-tenancy divider for your use case, is the option to quickly react to GDPR- or CCPA-related requests to destroy and remove any customer-related data. In this case, it is as easy as finding all the client’s tables and dropping them. Removing them from backups is a different story, though. 😌</p><p>The same is true with calculated and collected data. Separating those tables makes it much easier to throw away and recalculate all or parts of the data when late information arrives.</p><p>Also similar is the previously mentioned data retention. Many companies storing huge amounts of data on behalf of their customers provide different data retention policies based on how much the customer is willing to pay. Having tables sliced by customers makes it easy to set customer-specific retention policies and even change them when a customer upgrades to a higher tier. If this is something you need, the <em>multi-table design</em> is it.</p><p>However, just as with single tables, multiples have drawbacks too.</p><p>Besides the already slightly more complicated elements around querying, which are not necessarily a disadvantage, having many tables requires planning a table name schema. The more dimensions we bring into the game (by customer, metric type, etc.), the more complicated our naming schema needs to be. That said, we may end up with table names such as <code>&lt;&lt;customer_name&gt;&gt;__&lt;&lt;metric_type&gt;&gt;</code>. While this doesn’t sound too bad, it can get ugly fast. We’ve all been there before. 😅</p>
<pre><code>tsdb=&gt; \dt *.cust_*
                    List of relations
 Schema |            Name            | Type  |   Owner
--------+----------------------------+-------+-----------
 public | cust_mycompany_co2         | table | tsdbadmin
 public | cust_mycompany_humidity    | table | tsdbadmin
 public | cust_mycompany_temperature | table | tsdbadmin
 public | cust_timescale_co2         | table | tsdbadmin
 public | cust_timescale_humidity    | table | tsdbadmin
 public | cust_timescale_temperature | table | tsdbadmin
(6 rows)
</code></pre>
<p>Tools, such as an ORM framework, may make things even more complicated. Many of those tools are not designed to support arbitrary, runtime-generated table names, making it very complicated to integrate those with this concept. Using different PostgreSQL database schemas per customer and lowering the number of dimensions may help.</p><p>There is one more complexity: upgrading and migrating tables. Due to the multiple table design, we may end up with many similar tables segregated by the additional dimensions chosen. When upgrading the table schema, we need to ensure that all those tables eventually end up in a consistent state. </p><p>However, many automatic schema migration tools do not easily support that kind of multi-table migration. This forces us to fall back on writing migration scripts, trying to find all tables matching a specific naming schema, and ensuring that all are upgraded the same way. If we miss a table, we’ll figure it out eventually, but probably when it’s too late.</p><h1 id="the-tldr">The TL;DR</h1>
<p>Now that you’ve laid it all out and answered these questions, you can look at the requirements and see where your use case fits.</p><p>Some hard requirements may make your choice obvious, while some “nice-to-have” elements may still influence the final decision and hint at what may become a harder requirement in the near or far future.</p><table>
<thead>
<tr>
<th></th>
<th style="text-align:right"><strong>Single Table Design</strong></th>
<th style="text-align:right"><strong>Multiple Table Design</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Ease of Use</strong></td>
<td style="text-align:right">Easy</td>
<td style="text-align:right">Somewhat easy</td>
</tr>
<tr>
<td><strong>Multi-Tenancy / <br>Privacy Regulations</strong></td>
<td style="text-align:right">Hard</td>
<td style="text-align:right">Easy</td>
</tr>
<tr>
<td><strong>Future-Proofness</strong></td>
<td style="text-align:right">Easy</td>
<td style="text-align:right">Somewhat hard</td>
</tr>
<tr>
<td><strong>Tooling Support</strong></td>
<td style="text-align:right">Easy</td>
<td style="text-align:right">Hard</td>
</tr>
</tbody>
</table>
<p>While the <em>single table design</em> is very easy to get started with, it may be a hard sell if you need to comply with regulations. However, the <em>multiple table design</em> is certainly more complicated to manage and use correctly.</p><h1 id="so-whats-the-best-one-for-me">So What's the Best One for Me?</h1>
<p>There’s never a one-size-fits-all answer, or as consultants love to put it: it depends.</p><p>Unlike the design choices around the table layout, it’s too complex to make a real recommendation. You can only try to follow the suggested process of answering the questions above and looking at the answers to see which points represent hard requirements, which could become hard requirements, and which are simply nice to have. That way, you’ll likely find the answer to match your use case.</p><p>Also, remember that you may have different use cases with diverging requirements, so one use case may end up being perfectly fine running as a single table design, while the other one(s) may need multiple tables.</p><p>Plus, you have the chance to mix and match the benefits of both solutions. It was kind of hinted at in the text already, but it is possible to use a simplified multiple table design (for example, per metric type) and separate the customer dimension into a PostgreSQL database schema, with each schema representing one customer.</p><p>Similarly, it is possible to use the schema to separate customers and store all metrics/events/data for that specific customer in a single hypertable. There are plenty of options, only limited by your imagination.</p><p>Whatever you end up with, try to be as future-proof as possible. Try to imagine what the future will hold, and if you’re unsure whether a somewhat harder requirement will become a must-have, it may be worth considering it as a hard requirement now just to be on the safe side.</p><p>If you want to start designing your hypertable database schema as soon as possible, ensuring you get the best performance and user experience for your time-series data while achieving incredible compression ratios, check out <a href="https://console.cloud.timescale.com/signup">Timescale</a>.</p><p>If you’re looking to test it locally or run it on-prem, then we’ve got you covered: have a <a href="https://docs.timescale.com/">look at our documentation</a>.</p><h3 id="learn-more">Learn more</h3><ul><li><a href="https://timescale.ghost.io/blog/best-practices-for-time-series-metadata-tables/"><strong>Best Practices for (Time-)Series Metadata Tables</strong></a></li><li><a href="https://www.timescale.com/learn/designing-your-database-schema-wide-vs-narrow-postgresql-tables" rel="noreferrer"><strong>Choosing Your Database Schema: Wide vs. Narrow PostgreSQL Tables</strong></a></li></ul>]]></content:encoded>
        </item>
    </channel>
</rss>