Tiger Data Blog

Counter Analytics in PostgreSQL: Beyond Simple Data Denormalization

Jônatas Davi Paganini — Wed, 04 Dec 2024 21:42:37 GMT

If you've been working with PostgreSQL, you've probably seen memes advocating for denormalized counters instead of counting related records on demand. The debate usually looks like this:

-- The "don't do this" approach: counting related records on demand
SELECT COUNT(*) FROM post_likes WHERE post_id = $1;
-- The "do this instead" approach: maintaining a denormalized counter
SELECT likes_count FROM posts WHERE post_id = $1;

Let's break down these approaches. In the first approach, we calculate the like count by scanning the post_likes table each time we need the number. In the second approach, we maintain a pre-calculated counter in the posts table which we update whenever someone likes or unlikes a post.

The denormalized counter approach is often recommended for OLTP (online transaction processing) workloads because it trades write overhead for read performance. Instead of executing a potentially expensive COUNT query that needs to scan the entire post_likes table, we can quickly fetch a pre-calculated number.

This is particularly valuable in social media applications, where like counts are frequently displayed but rarely updated—you're showing like counts on posts much more frequently than users are actually liking posts.

However, when we enter the world of time-series data and high-frequency updates, this conventional wisdom needs a second look. Let me share an example that made me reconsider this approach while working with a PostgreSQL database optimized for time series via the TimescaleDB extension.

Source

While this advice might make sense for traditional OLTP workloads, when working with time-series data in TimescaleDB, we need to take a different approach to data modeling.

🔖

To learn more about data modeling in PostgreSQL, check out our guide.

Counter Analytics vs. Data Denormalization and Its Limitations

Let's start with a common scenario: tracking post likes in a social media application. The traditional data denormalization approach might look like this:

-- Traditional table structure
CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    content TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    likes_count INTEGER DEFAULT 0
);

CREATE TABLE post_likes (
    post_id INTEGER REFERENCES posts(id),
    user_id INTEGER,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (post_id, user_id)
);

With this structure, every like operation requires two updates:

-- When a user likes a post
BEGIN;
INSERT INTO post_likes (post_id, user_id) VALUES (1, 123);
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 1;
COMMIT;

The hidden costs of data denormalization

While this might seem efficient at first glance, it introduces several problems:

1. VACUUM overhead: Every update to likes_count creates a new version of the row in the posts table. PostgreSQL's MVCC (multiversion concurrency control) means old versions aren't immediately removed, leading to the following:

-- Check bloat in posts table
SELECT schemaname, relname, n_dead_tup, n_live_tup, last_vacuum
FROM pg_stat_user_tables
WHERE relname = 'posts';

2. Transaction contention: Multiple concurrent likes on the same post create lock contention on the posts row.

The TimescaleDB Way: Counter Analytics for Time Series

This is one of those cases where TimescaleDB can give PostgreSQL a helping hand. Instead of maintaining a running counter, let's leverage TimescaleDB's strengths. We’ll start by using a hypertable to partition the data automatically by the time column.

-- Create a hypertable for post_likes
CREATE TABLE post_likes (
    post_id INTEGER,
    user_id INTEGER,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (post_id, user_id, created_at)
);

SELECT create_hypertable('post_likes', by_range('created_at', INTERVAL '1 month'));

The TimescaleDB extension will automatically create new child tables and split them into several partitions, in this case, one per month. Check the key performance advantage of adopting hypertables:

Parallel computation for queries: all counts and statistics can be parallelized across the partitions.
Data lifecycle: tables partitioned by time allow you to easily compress data after X days or drop the entire partition after X months.
Columnar compression can be enabled and will work as an index to segment the data.

It’s important to remember that the hypertable architecture is a paradigm shift in database partitioning. That’s because the partition stores its own table statistics and indices, making the policies faster for dropping entire partitions without any extra work for vacuum or updates.

Continuous aggregates for efficient counting

Parallelizing will not avoid rescanning the full dataset for any necessary statistics. To increase efficiency, we can consider grouping data hourly and processing it hour by hour. Vanilla PostgreSQL does not allow partial refreshes on materialized views, which is why Timescale developed the continuous aggregation feature.

The continuous aggregate will maintain pre-computed counts. Instead of computing counts during query time or updating every new like, we can create a materialized view with superpowers.

-- Create a view for hourly like counts
CREATE MATERIALIZED VIEW post_likes_hourly
WITH (timescaledb.continuous) AS
SELECT 
    post_id,
    time_bucket('1 hour', created_at) AS bucket,
    count(*) as likes_count
FROM post_likes
GROUP BY post_id, time_bucket('1 hour', created_at);

-- Set refresh policy
SELECT add_continuous_aggregate_policy('post_likes_hourly',
    start_offset => INTERVAL '3 hours',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');

The refresh policy makes it run on a schedule and only refreshes the part that has not been computed yet. Through a “watermark” mechanism, the refresh time is stored, and the data is updated from the latest watermark point. You can read more about it in our dev’s intro to continuous aggregates.

You may be thinking, “What? But what if I change the raw data?” TimescaleDB can also track it and refresh only the updated parts.

If you like this idea, you'll probably also love the ability to use continuous aggregates hierarchically.

Benefits of the TimescaleDB Approach

Efficient storage: TimescaleDB's chunking mechanism automatically partitions data by time, making fewer VACUUM operations necessary.
Better concurrency: no need to update a single counter row, eliminating lock contention.
Rich analytics: we can easily answer complex questions.

-- Get likes trend over time
SELECT 
    post_id,
    bucket,
    likes_count,
    sum(likes_count) OVER (PARTITION BY post_id ORDER BY bucket) as cumulative_likes
FROM post_likes_hourly
WHERE post_id = 1
ORDER BY bucket DESC;

Performance comparison: Counter analytics vs. data denormalization

Let's benchmark both approaches:

-- Traditional approach
EXPLAIN ANALYZE
UPDATE posts SET likes_count = likes_count + 1 WHERE id = 1;

-- TimescaleDB approach
EXPLAIN ANALYZE
INSERT INTO post_likes (post_id, user_id, created_at) 
VALUES (1, 123, NOW());

The TimescaleDB approach shows better performance characteristics under high concurrency and provides more analytical capabilities.

Best Practices for Real-Time Counts

For applications requiring real-time counts, we can set the materialized view parameter timescaledb.materialized_only=false to refresh the view on demand.

CREATE MATERIALIZED VIEW post_likes_hourly
WITH (timescaledb.continuous, timescaledb.materialized_only=false) AS
SELECT 
    post_id,
    time_bucket('1 hour', created_at) AS bucket,
    count(*) as likes_count
FROM post_likes
GROUP BY post_id, time_bucket('1 hour', created_at);

Behind the scenes, TimescaleDB will create a hypertable for the materialized view and refresh the view according to the refresh policy. When the refresh starts, it saves a watermark to track the latest refreshed bucket.

When you query the posts_liks_hourly, it combines the materialized data with the latest bucket from the hypertable filtering only on the buckets greater than the watermark. It means that instead of scanning the raw dataset, it will just process the part that has not materialized yet.

Establishing a Retention Policy

Now that we have a continuous aggregate, we need to establish a retention policy to prevent the hypertable from growing indefinitely. As we're storing the data in chunks, we can set a retention policy to delete the chunks that are older than a certain period.

SELECT add_retention_policy('post_likes_hourly', INTERVAL '1 month');

This command runs a background job that deletes chunks older than one month. The past data will be deleted in the background, and the continuous aggregate will remain up to date.

Also, the data will be removed only when the entire partition is going to be dropped. Every partition has its own metadata, without any need to update statistics or give any extra work for the VACUUM process.

Conclusion

While denormalized counters might seem appealing for simple OLTP workloads, TimescaleDB's time-series capabilities offer a more scalable and maintainable solution. By leveraging continuous aggregates and proper time-series modeling, we can achieve better performance, richer analytics, and more reliable data management.

Remember:

Use hypertables for time-series data
Leverage continuous aggregates for efficient computations
Consider the full lifecycle of your data, including retention policies
Think in terms of time-series patterns rather than traditional OLTP patterns

This approach might require a mindset shift, but the benefits in terms of scalability and maintenance make it worthwhile for time-series workloads. To give TimescaleDB a try, install it on your machine. If you prefer a mature, managed PostgreSQL platform that delivers even more scalability, you can try Timescale Cloud for free.

Learn more

Building a Better Ruby ORM for Time Series and Analytics

Jônatas Davi Paganini — Wed, 27 Nov 2024 13:30:11 GMT

Rails developers know the joy of working with ActiveRecord. DHH didn’t just give us a framework; he gave us a philosophy, an intuitive way to manage data that feels delightful. But when it comes to time-series data, think metrics, logs, or events, ActiveRecord can start to feel a little stretched. Handling huge volumes of time-stamped data efficiently for analytics? That’s a challenge it wasn’t designed to solve (and neither was PostgreSQL).

This is where TimescaleDB comes in. Built on PostgreSQL (it’s an extension), TimescaleDB is purpose-built for time series and other demanding workloads, and thanks to the timescaledb gem, it integrates seamlessly into Rails. You don’t have to leave behind the conventions or patterns you love, it just works alongside them.

One of TimescaleDB’s standout features is continuous aggregates. Think of them as an upgrade to materialized views, automatically refreshing in the background so your data is always up-to-date and fast to query. With the new timescaledb gem continuous aggregates macro, you can define hierarchical time-based summaries in a single line of Ruby. It even reuses your existing ActiveRecord scopes, so you’re not duplicating logic you’ve already written.

Now, your Rails app can effortlessly handle real-time analytics dashboards or historical reports, scaling your time-series workloads while staying true to the Rails philosophy.

Better Time-Series Data Aggregations Using Ruby: The Inspiration

The following code snippet highlights the real-life use case that inspired me to build a continuous aggregates macro for better time-series data aggregations. It’s part of a RubyGems contribution I made, and it’s still a work in progress. However, it’s worth validating how this idea can reduce the Ruby code you’ll have to maintain.

Example model

class Download < ActiveRecord::Base
  extend Timescaledb::ActsAsHypertable
  include Timescaledb::ContinuousAggregatesHelper

  acts_as_hypertable time_column: 'ts'

  scope :total_downloads, -> { select("count(*) as total") }
  scope :downloads_by_gem, -> { select("gem_name, count(*) as total").group(:gem_name) }
  scope :downloads_by_version, -> { select("gem_name, gem_version, count(*) as total").group(:gem_name, :gem_version) }

  continuous_aggregates(
    timeframes: [:minute, :hour, :day, :month],
    scopes: [:total_downloads, :downloads_by_gem, :downloads_by_version],
    refresh_policy: {
      minute: { start_offset: "10 minutes", end_offset: "1 minute", schedule_interval: "1 minute" },
      hour:   { start_offset: "4 hour",     end_offset: "1 hour",   schedule_interval: "1 hour" },
      day:    { start_offset: "3 day",      end_offset: "1 day",    schedule_interval: "1 day" },
      month:  { start_offset: "3 month",    end_offset: "1 day",  schedule_interval: "1 day" }
  })
end

The refresh_policy will work for all basic frames, but it is not mandatory and can be skipped. Now, remember that declaring the macro in the model has almost no effect until you run a migration that uses such metadata. The creation of the continuous aggregates needs to happen on a database migration through the call of migration helpers that can use the information. Let’s take a look at the helpers we have.

The migration helpers

The macro will create a continuous aggregate in the model, but for migration, it can generate the SQL code for all the views iterating on each timeframe and scope you specify.

The create_continuous_aggregates and drop_continuous_aggregates methods are designed to be invoked during the database migration step.

So, after saving your model with the new continuous_aggregate definition, you can use the create_continuous_aggregate method to invoke the creation of all materialized views in the database. If you use refresh_policy, it will also add all the policies along with the aggregation. Here’s what a migration file would look like:

class SetupMyAmazingCaggsMigration < ActiveRecord::Migration[7.0]
  def up
    Download.create_continuous_aggregates
  end

  def down
    Download.drop_continuous_aggregates
  end
end

It will automatically create all the continuous aggregates for all timeframes and scopes in the right dependency order. When the create_continuous_aggregates is called, 12 continuous aggregates will be created, starting from minute to month.

The migration output

Let’s take a deep look at what the SQL behind the scenes looks like when the method create_continuous_aggregates is called. From the first scope, it builds the continuous aggregates, fetching the data from the raw data.

CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_minute
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) as ts, count(*) as total
FROM "downloads"
GROUP BY 1
WITH NO DATA;

Every materialization occurs independently, and to happen automatically, a refresh policy needs to be added. As it was specified generically by timeframe, it now incorporates the minute refresh for the policy.

SELECT add_continuous_aggregate_policy('total_downloads_per_minute',
  start_offset => INTERVAL '10 minutes',
  end_offset =>  INTERVAL '1 minute',
  schedule_interval => INTERVAL '1 minute');

Now, continuing the creation, it goes for the hourly level, already reusing the data from the previous materialized view.

CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) as ts, sum(total) as total FROM "total_downloads_per_minute" 
GROUP BY 1
WITH NO DATA;

An hourly policy is also established to guarantee that it will refresh automatically. The same iteration is repeated for daily and monthly timeframes. Later, the same process will repeat for the other timeframes.

SELECT add_continuous_aggregate_policy('total_downloads_per_hour',
  start_offset => INTERVAL '4 hour',
  end_offset =>  INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_day
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', ts) as ts, sum(total) as total FROM "total_downloads_per_hour" GROUP BY 1
WITH NO DATA;

SELECT add_continuous_aggregate_policy('total_downloads_per_day',
  start_offset => INTERVAL '3 day',
  end_offset =>  INTERVAL '1 day',
  schedule_interval => INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS total_downloads_per_month
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', ts) as ts, sum(total) as total FROM "total_downloads_per_day" GROUP BY 1
WITH NO DATA;

SELECT add_continuous_aggregate_policy('total_downloads_per_month',
  start_offset => INTERVAL '3 month',
  end_offset =>  INTERVAL '1 day',
  schedule_interval => INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_minute
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) as ts, gem_name, count(*) as total FROM "downloads" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_minute',
  start_offset => INTERVAL '10 minutes',
  end_offset =>  INTERVAL '1 minute',
  schedule_interval => INTERVAL '1 minute');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) as ts, gem_name, sum(total) as total FROM "downloads_by_gem_per_minute" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_hour',
  start_offset => INTERVAL '4 hour',
  end_offset =>  INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_day
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', ts) as ts, gem_name, sum(total) as total FROM "downloads_by_gem_per_hour" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_day',
  start_offset => INTERVAL '3 day',
  end_offset =>  INTERVAL '1 day',
  schedule_interval => INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_gem_per_month
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', ts) as ts, gem_name, sum(total) as total FROM "downloads_by_gem_per_day" GROUP BY 1, gem_name
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_gem_per_month',
  start_offset => INTERVAL '3 month',
  end_offset =>  INTERVAL '1 day',
  schedule_interval => INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_minute
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', ts) as ts, gem_name, gem_version, count(*) as total FROM "downloads" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_minute',
  start_offset => INTERVAL '10 minutes',
  end_offset =>  INTERVAL '1 minute',
  schedule_interval => INTERVAL '1 minute');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_hour
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', ts) as ts, gem_name, gem_version, sum(total) as total FROM "downloads_by_version_per_minute" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_hour',
  start_offset => INTERVAL '4 hour',
  end_offset =>  INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_day
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', ts) as ts, gem_name, gem_version, sum(total) as total FROM "downloads_by_version_per_hour" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_day',
  start_offset => INTERVAL '3 day',
  end_offset =>  INTERVAL '1 day',
  schedule_interval => INTERVAL '1 day');

CREATE MATERIALIZED VIEW IF NOT EXISTS downloads_by_version_per_month
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 month', ts) as ts, gem_name, gem_version, sum(total) as total FROM "downloads_by_version_per_day" GROUP BY 1, gem_name, gem_version
WITH NO DATA;

SELECT add_continuous_aggregate_policy('downloads_by_version_per_month',
  start_offset => INTERVAL '3 month',
  end_offset =>  INTERVAL '1 day',
  schedule_interval => INTERVAL '1 day');

That’s massive, right?! It’s probably too boring to read it all because it’s almost a repetitive structure, iterating over all the scopes. The continuous_aggregates leverages all logic by iterating over all the timeframes with all scopes. It reuses minute data in the hourly view and uses the same technique from hour to day, day to month, and so on.

In contrast, reusing the aggregations, if written all by hand, makes the process really error-prone. The Model.drop_continuous_aggregates method uses the reverse dependency path to call the drop materialized view from month to minute.

Continuously aggregating statistics can replace dozens of background jobs hosted by your application, avoiding serialization and deserialization efforts apart from bandwidth, I/O (input/output), and overuse of resources in general.

Reusing the previous timeframes makes it very fast and lightweight for the database to process. Adopting hierarchical processing also allows all processing to be done at a predictable speed because the number of rows will be static and only dependent on the cardinality of the data.

Processing aggregations in the database means there will only be calls between the database and the disk, releasing interactions between the application and the database and forcing network data trips to process it on application background jobs.

Now, let’s take a look at how the rollup works.

Hyperfunctions Integration for Faster Time-Series Analysis

Timescale also built a specialized extension for time-series data processing, the timescaledb-toolkit. It helps improve the developer experience and query performance, and most of its functions are called hyperfunctions.

Hyperfunctions are designed to reuse and make statistics fast for hypertables, allowing you to roll up granular aggregations into bigger timeframes. In the case of the Ruby library, it should work well with both regular statistics functions and also roll up the hyperfunctions already available.

The most important part of using multiple timeframes and scopes is to understand how the rollup scope works.

For example, if you have a scope called total_downloads and a timeframe of day, the rollup will rewrite the query to group by the day.

# Original query
SELECT count(*) FROM downloads;

# Rolled up query
SELECT time_bucket('1 day', created_at) AS day, count(*) FROM downloads GROUP BY day;

In Ruby, the rollup method will help to roll up such queries in a more efficient way. Let’s consider the total_downloads scope as an example:

Download.total_downloads.map(&:attributes) #  => [{"total"=>6175}
# SELECT count(*) as total FROM "downloads"

The rollup scope will help to group data by a specific timeframe. Let’s start with one minute:

Download.total_downloads.rollup("'1 min'").map(&:attributes)
# SELECT time_bucket('1 min', ts) as ts, count(*) as total FROM "downloads" GROUP BY 1
=> [{"ts"=>2024-04-26 00:10:00 UTC, "total"=>110},
 {"ts"=>2024-04-26 00:11:00 UTC, "total"=>1322},
 {"ts"=>2024-04-26 00:12:00 UTC, "total"=>1461},
 {"ts"=>2024-04-26 00:13:00 UTC, "total"=>1150},
 {"ts"=>2024-04-26 00:14:00 UTC, "total"=>1127},
 {"ts"=>2024-04-26 00:15:00 UTC, "total"=>1005}]

As you can see, the time_bucket function is introduced, and a group by clause is also added.

If the current query uses a component like candlestick_agg, it will be able to call the rollup SQL function, and that’s where the name of the function comes from.

What if I want to sum the counters from the materialized view behind the scenes and roll up to a bigger frame? That’s when the aggregated classes join the game.

Continuous aggregates are hypertables. They’re materialized views that are periodically being updated in the background according to the refresh policy. Every aggregation can be accessed and refreshed independently.

Aggregates classes

In the previous example, the rollup was done directly in the raw data. Now, let’s explore how the continuous_aggregates macro creates a class for each aggregated view that is in the database. The classes can be accessed as subclasses in the model and also inherit the model as they’re fully dependent on it.

So, to access the materialized data, instead of building the query from raw data, nested classes are created with the Model::ScopeNamePerTimeframe naming convention.

Download::TotalDownloadsPerMinute.all.map(&:attributes)
# SELECT "total_downloads_per_minute".* FROM "total_downloads_per_minute"
=> [{"ts"=>2024-04-26 00:10:00 UTC, "total"=>110},
 {"ts"=>2024-04-26 00:11:00 UTC, "total"=>1322},
 {"ts"=>2024-04-26 00:12:00 UTC, "total"=>1461},
 {"ts"=>2024-04-26 00:13:00 UTC, "total"=>1150},
 {"ts"=>2024-04-26 00:14:00 UTC, "total"=>1127},
 {"ts"=>2024-04-26 00:15:00 UTC, "total"=>1005}]

To roll up from the materialized data, we need to consider how the data was built. So, to have the counter, we need to count rows from the hypertable raw data, but for bigger timeframes, we can just sum the counters. Here’s what it looks like if you need to roll up any scope to other timeframes:

Download::TotalDownloadsPerMinute.select("sum(total) as total").rollup("'2 min'").map(&:attributes)
# SELECT time_bucket('2 min', ts) as ts, sum(total) as total FROM "total_downloads_per_minute" GROUP BY 1
=> [{"ts"=>2024-04-26 00:12:00 UTC, "total"=>2611}, {"ts"=>2024-04-26 00:14:00 UTC, "total"=>2132}, {"ts"=>2024-04-26 00:10:00 UTC, "total"=>1432}]

With the rollup scope, you can easily build custom scopes and regroup as you need. It supports a few statistic scenarios on rollup to automatically detect SQL statements that contain count(*) as total and transform them into sum(total) as totalthem. It can also get a min of min or max of max values when it’s rolling up into larger time frames.

Refresh aggregates

If you need to refresh all aggregates manually in the right order, you can also use the refresh_aggregates method:

Download.refresh_aggregates

Next steps

That’s all, folks! I posted a few more details in my blog during the development phase. If you have any questions or feedback, join the #ruby channel on the TimescaleDB Slack. Also, GitHub ⭐s for our Ruby library are very much welcome!

To give it a try and use the continuous_aggregates macro on your project, install the timescaledb gem. Happy coding—but write fewer lines of code.

Self-Hosted or Cloud Database? A Countryside Reflection on Infrastructure Choices

Jônatas Davi Paganini — Wed, 03 Apr 2024 15:54:16 GMT

The choice between using a cloud database or opting for a self-managed setup is critical for every developer as it affects the entire framework through which an organization processes its data. Funny enough, nothing has taught me more invaluable lessons about infrastructure than living in the countryside for the past four years. These experiences have shaped my mindset about what I prefer to manage myself and what I’d rather have as a service.

Much like my countryside life, which has the benefits and drawbacks of managing one's infrastructure versus relying on external services, this article aims to draw parallels to guide you through your digital infrastructure decisions.

I’ll also compare deployment options, emphasizing that the implications of this choice extend far beyond the technical: they influence your organization's agility, efficiency, and long-term scalability. I hope these insights will act as a sort of compass, directing you toward a decision that aligns with your strategic objectives and operational capabilities.

Self-Hosting vs. Cloud: A Water Management Lesson

Before explaining how my water system got me thinking of databases and infrastructure choices, let me clarify the decision on the table here. Self-hosting a database involves running it on your own physical or virtual servers, requiring maintenance, security, and scalability management. In contrast, a cloud database is hosted and managed by a third-party cloud provider, offering scalability, automated backups, and reduced maintenance overhead, allowing developers to focus more on application development and less on infrastructure management.

The thing is, water is no different from data. In businesses, data flows like water through systems, streaming seamlessly and requiring to be stored. Water or data are crucial for your system infrastructure—they are vital for the show to go on, irrespective of their challenges.

System infrastructure as a water system

In the countryside, choosing my water system was similar to selecting a well-integrated infrastructure. System infrastructure refers to the underlying framework that supports the operation of software applications. The same could be said about my water system if talking about general human operations.

In the driest seasons, I was compelled to build a resilient water recycling system. After heavy rains, my large repository would be brimming, providing essentials for showers, dishes, and laundry. However, managing this infrastructure wasn’t without its challenges. Annually, I’d face issues like broken pipes, the need to pump water from the lake, water pump failures, and clogged filters.

These experiences parallel the challenges in managing business infrastructure:

Unexpected breakdowns: just as pipes break, systems can fail.
Resource scarcity: like running out of water, businesses can face resource shortages.
Maintenance needs: like repairing a water pump, systems require regular upkeep.
Regular updates: comparable to changing filters, systems need continual updates.

Self-Hosting vs. Cloud Services: Making a Choice

Reflecting on my rural infrastructure, self-hosting was my only option. But what about businesses? Consider these questions:

Do you have the infrastructure and skills to manage emergencies anytime?
Are you prepared to invest in and maintain your infrastructure?
If the answer to any of these is “no,” self-hosting might be a temporary solution.

The right mindset for system infrastructure

This is more than opting for self-managing a database or a cloud provider; it’s about understanding your business limitations and choosing the option that sustains your business longer.

To make an informed choice, here are some key considerations you’ll need to make:

Learning costs during downtimes: Downtime is not just a technical setback—it's a period of intensive learning under pressure. Organizations must evaluate whether they have the resources and resilience to absorb the learning curve of diagnosing and resolving infrastructure failures in-house. The cost of this learning, both in terms of time and lost productivity, can be significant.
Business risks during outages: Outages directly threaten your business continuity. The longer your systems are down, the greater the risk to your reputation, customer satisfaction, and revenue. Assessing the potential impact of outages is crucial in understanding whether the self-hosted approach aligns with your risk tolerance and business continuity plans.
Team commitment to infrastructure responsibilities: Choosing to self-host means your team will bear the full weight of infrastructure responsibilities—from routine maintenance to emergency response. This commitment requires a dedicated, skilled team that's prepared to tackle challenges as they arise. Reflect on whether your team has the bandwidth and expertise to manage these tasks without detracting from their core functions.
Training availability: Your team's effectiveness in managing a self-hosted infrastructure heavily relies on their ongoing education and training. Consider whether you have access to the necessary training resources to keep your team up-to-date with the latest technologies and best practices in infrastructure management.

These considerations go beyond the surface-level appeal of having complete control over your infrastructure. They highlight the depth of commitment and preparedness needed to ensure that a self-hosted solution supports, rather than hinders, your organization's goals.

Below, I've outlined some of the primary areas of concern when opting for self-hosting, paired with the inevitable consequences businesses might face if they don’t prepare correctly:

Challenge Area	Consequence
Without a ready emergency response	Eventually, a critical error will halt systems, leading to operational paralysis. The team will need to scramble to respond, undermining the stability of all operations. There will be an expensive delay while figuring out the necessary steps to recovery.
Without investment and maintenance	Infrastructure will become overwhelmed as operations scale, leading to performance bottlenecks. Bugs and system issues will proliferate as the infrastructure expands, reducing system reliability. Fragile infrastructure components will fail, precipitating emergencies and further destabilizing operations.

Looking at these complexities, the contrast with cloud services becomes clearer, illustrating the value of scalability, reliability, and reduced operational burdens that cloud services can offer.

You can build a resilient team for self-hosting your database, but you’ll need sufficient resources and investment. Plus, you will have to be fully transparent with your customers to build customer confidence. If self-hosting isn’t feasible, let your customers know or find ways to enhance your infrastructure together.

This is where Timescale’s self-hosting support options can lend you a helping hand. 🤝

Self-Hosting With Timescale

Self-hosting does not mean you have to do everything alone. When considering the self-hosting route for managing time-series data, Timescale is about empowering control. It enables users who want to self-host and control their infrastructure to be backed by comprehensive support to ensure their operations run smoothly.

To do this, Timescale provides specialized support packages tailored to production and development environments designed to mitigate the challenges of self-hosting.

Timescale Production and Development Support Packages

For organizations committed to self-hosting their time-series databases, Timescale provides a tiered support system designed to address the needs of both production and development stages. This support includes:

All email Support requests are fielded within one business day, ensuring that any queries or issues are promptly addressed, minimizing delays in troubleshooting and resolution.
24x7 on-call support with a one-hour response time for severe or critical issues that threaten production environments. Timescale offers dedicated on-call support to provide real-time expertise, significantly reducing downtime.
Dedicated Support portal: A centralized location for all your support needs, providing easy access to assistance and resources.
Production Support as a Service: This feature offloads the burden of emergency responses and infrastructure troubleshooting from your team, allowing you to focus on core operations while relying on Timescale's expertise.

For more detailed information on how TimescaleDB can support your self-hosting requirements, visit Timescale's Self-Managed Support Page.

Self-managed TimescaleDB features

Besides providing support for your self-hosted database, TimescaleDB enhances PostgreSQL—one of the best-known reliable databases—with features specifically designed for time-series data, making it an attractive option for self-hosting scenarios:

Hypertables: These are designed to handle massive datasets by automatically partitioning data across time and space while still allowing you to interact with them as though they were standard PostgreSQL tables.
Continuous Aggregates: Time-series queries often require aggregating data over time intervals. Continuous aggregates simplify this by automatically updating incrementally, saving processing time and resources.
Compression: Leveraging columnar storage and time-partitioned data structures, TimescaleDB offers efficient compression mechanisms to reduce storage costs and improve query performance.
Full SQL: Unlike some NoSQL databases designed for time-series data, TimescaleDB does not compromise on the power of SQL, offering full compatibility with PostgreSQL for ease of use and flexibility.

(To learn more about these features, check out our documentation.)

With Timescale, you can mitigate some of the traditional challenges associated with self-hosted databases, benefiting from a system that combines the scalability and flexibility of a conventional SQL database with the performance and efficiency required for modern data workloads.

But, even with the Timescale Support team by your side, managing your self-hosted setup remains a significant responsibility. It may be time to start considering an alternative: cloud services. The cloud technology era is not just upon us—it is shaping the future of data management, offering a distinct path from traditional self-hosting models. Shame I can't get a similar model for my water system.

Cloud services are specifically designed to expedite business operations and iterations. They represent an ideal solution for companies that prefer not to invest heavily in internal teams dedicated to ensuring resilience and managing infrastructure complexities. The suitability of cloud services for your organization hinges on several factors, including your business objectives, the current stage of your Service Level Agreements (SLAs), and your growth ambitions.

The Role of Cloud Services in Scaling and Infrastructure Management

Cloud services provide a reliable framework for scaling, enabling you to adapt quickly to changing demands without the upfront costs typically associated with physical infrastructure investments. Here are some key advantages:

Infrastructure investment at scale: Cloud services allow businesses to purchase infrastructure wholesale, translating to significant savings on time and personnel despite potentially higher direct spending.
Security and reliability: By offloading security and reliability concerns to the cloud provider, companies can focus more on their core business functions.
Transparency and control: While adopting cloud services may result in less operational transparency, the trade-off comes with access to a suite of services and support that can dramatically simplify infrastructure management.

Timescale’s Cloud Services: Empowering Your Data Management

Timescale provides a comprehensive cloud solution designed to optimize time-series data management without the operational overhead of self-hosting:

Free Production Support: Ensures that your operations run smoothly with expert assistance readily available.
Data tiering and usage-based cost tiers: Optimizes your storage spending according to your actual needs, ensuring cost-efficiency.
Scalability without the traditional constraints: With compute and storage decoupled, scalability becomes both cost-efficient and performance-optimized.
High availability, security, and compliance: Features like automated backups, upgrades, and end-to-end encryption ensure your data is secure, compliant, and available when needed.
Insights and analytics: In-console metric visualization and detailed query information enhance your ability to monitor and improve performance.

The Verdict: Self-Hosting vs. Cloud Services

Choosing between self-hosting and cloud services boils down to a strategic decision based on your company's specific needs and goals:

Self-hosting offers clarity, transparency, and guaranteed integration with your existing systems, giving you total control over your infrastructure.

Cloud services streamline infrastructure investment and updates, removing the need for extensive personnel or tooling investments for security and emergencies, allowing you to concentrate on your core business.

Ultimately, these are not rigid rules but guiding principles to help you make informed decisions. Whether you opt for self-hosting or cloud services, choose what aligns best with your business goals and needs. In either scenario, Timescale provides tailored support and solutions to ensure your time-series data infrastructure is optimized, secure, and scalable.

Whether you're deciding between self-hosting and cloud services or looking for ways to optimize your current setup, the Timescale Slack Community #tech-design channel is an excellent resource for collective learning and support.

Join the Timescale Slack Community, where you can talk to me and many other like-minded developers about their infrastructure choices. See you there! 👋

Supercharge Your AI Agents With Postgres: An Experiment With OpenAI's GPT-4

Jônatas Davi Paganini — Wed, 26 Jul 2023 13:00:22 GMT

Hello developers, AI enthusiasts, and everyone eager to push the boundaries of what's possible with technology! Today, we're exploring AI agents as intermediaries in a fascinating intersection of fields: Artificial Intelligence and databases.

The Dawn of AI Agents

AI agents are at the heart of the tech industry's ongoing revolution. As programs capable of autonomous actions in their environment, AI agents analyze, make decisions, and execute actions that drive a myriad of applications. From autonomous vehicles and voice assistants to recommendation systems and customer service bots, AI agents are changing the way we interact with technology.

But what if we could take it a step further? What if we could use AI to simplify how we interact with databases? Could AI agents act as intermediaries, interpreting human language and converting it into structured database queries?

A Ruby Experiment With GPT-4

That's exactly what we tried to achieve in a recent experiment. Leveraging OpenAI's GPT-4, a powerful language model, we conducted an experiment to see how we could use AI to interact with our databases using everyday language.

The experiment was built using Ruby, and you can find the detailed explanation and code here. The results were fascinating, revealing the potential power of using AI as a “middle-man” (Middle-tech? Middle-bot?) between humans and databases.

Check out the videos throughout this blog post to see it in action:

Why Store Data for AI Agents?

Data storage is crucial for the successful application of AI, particularly for training and fine-tuning models. By storing interactions, results, and other relevant data, we can improve the performance and accuracy of our AI agents over time.

But data storage is not just about improving our AI; it's also about cost-effectiveness. With the OpenAI API, you pay per token, which can add up when dealing with large amounts of data. By using PostgreSQL as long-term memory for your AI agent, you can reduce the number of tokens you send to the OpenAI API, saving computational resources and money.

PostgreSQL: Flexible and Robust

PostgreSQL is a powerful, open-source relational database system. With a reputation for reliability, robustness, and performance, it's a fantastic choice for your AI's long-term memory. PostgreSQL also offers flexibility and scalability, making it suitable for projects of all sizes.

Whether you're conducting experiments or deploying production-ready applications, PostgreSQL's flexibility and robust nature make it an excellent companion for your AI.

Needless to say, we’re huge PostgreSQL enthusiasts here at Timescale—so much so that we built Timescale on PostgreSQL. Timescale works just like PostgreSQL under the hood, offering the same 100 percent SQL support (not SQL-like) and a rich ecosystem of connectors and tools but supercharging PostgreSQL for analytics, events, and time series (and time-series-like workloads).

With additional features like compression and automatically updated incremental materialized views—we call them continuous aggregates—Timescale allows you to scale PostgreSQL further for optimal performance while enjoying the best developer experience and cost-effectiveness.

But why all this talk about Timescale? As the conversation between human and machine is happening on point in time, I realize I’m dealing with time-series data. Cue in TimescaleDB for the rescue!

Join the Timescale Community

We're just scratching the surface of what's possible when combining AI with databases like PostgreSQL, and we'd love for you to join us on this journey.

Got a cool idea? A question? Or just want to share your thoughts on this topic? Join the Timescale Community on Slack and head over to the #ai-llm-discussion channel. Let's push the boundaries together and shape the future of AI!

Check this page to learn how to power agents, chatbots, and other large language models AI applications with PostgreSQL. To see what my fellow Timescalers Avthar, Mat, and Sam are already building, read their post on PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector.

Remember, technology grows exponentially when great minds come together. See you there!