Tiger Data Blog

Building Multi-Tenant RAG Applications With PostgreSQL: Choosing the Right Approach

Avthar Sewrathan — Fri, 11 Oct 2024 15:51:00 GMT

If you’re building a retrieval-augmented generation (RAG) app with PostgreSQL and pgvector, you’ll probably run into the problem of handling multi-tenancy. This article explains how to pick the right approach to handle multi-tenancy for your use case.

What Is Multi-Tenancy?

Multi-tenancy is like an apartment building for software. Just as one building houses multiple tenants (families or individuals), a multi-tenant application serves multiple customers or organizations using a single instance of the software.

Multi-tenancy serves multiple "tenants" independently and securely—thereby preventing accidental or unauthorized cross-referencing of private information between different users. This means designing a system that not only understands and retrieves information effectively but also strictly adheres to user-specific data boundaries.

Multi-Tenant Applications: Pros and Cons

Multi-tenancy in RAG applications is vital for several key reasons, all of which deliver benefits:

Data isolation: This allows multiple tenants to use the same RAG system while keeping their data separate and secure. This is crucial for maintaining privacy and confidentiality.
Scalability: A multi-tenant architecture enables the system to efficiently serve many users or organizations without needing separate deployments for each, leading to better resource utilization and cost-effectiveness.
Customization: Different tenants often have unique needs. Multi-tenancy allows for customization of the RAG system for each tenant, such as using tenant-specific knowledge bases or fine-tuning models for particular domains.
Compliance: In many industries, regulatory requirements mandate strict data separation. Multi-tenancy helps meet these compliance needs by ensuring data from different tenants doesn't mix.
Efficient updates: With a multi-tenant system, updates and improvements can be rolled out to all tenants simultaneously, ensuring everyone benefits from the latest features and security patches.
Cost-effectiveness: Sharing infrastructure and resources across tenants can significantly reduce operational costs compared to maintaining separate systems for each user or organization.
Consistent performance: A well-designed multi-tenant RAG system can provide more consistent performance across all tenants, as resources are dynamically allocated based on usage.

To build for long-term performance, flexibility, and efficiency, it’s also helpful to keep in mind the potential challenges of multi-tenant architectures:

A multi-tenant environment allows multiple access points for users, which can increase the threat of a security breach.
Serving multiple clients in one instance of an application or database adds an extra level of complexity to the codebase and database maintenance.
Backup and restoration are more complex, so not all providers offer reliable restoration services.
The ability to offer tenant-specific customizations is limited, and balancing the shared codebase with unique tenant requirements is often necessary.
A technical problem on the provider’s end may affect all tenants simultaneously. This may apply to uptime, system upgrades, and other global processes.

PostgreSQL Benefits for Multi-Tenant RAG Apps

PostgreSQL, enhanced with the pgvector extension (the popular open-source extension for vector handling in PostgreSQL), offers a robust solution for implementing multi-tenant RAG apps. Its ability to efficiently store and search vector embeddings alongside traditional data types makes it an ideal choice for organizations looking to leverage their existing infrastructure.

Here are the reasons why PostgreSQL is a good fit for multi-tenant RAG applications:

Built-in full-text search: PostgreSQL has robust full-text search capabilities, which are crucial for efficient retrieval in RAG systems.
JSON support: PostgreSQL handles JSON data natively, allowing flexible storage of documents and metadata.
Vector extensions: extensions like pgvector enable vector similarity searches, essential for embedding-based retrieval.
Row-level security: this feature allows fine-grained access control, crucial for multi-tenant setups to ensure data isolation.
Scalability: PostgreSQL can handle large datasets and concurrent users, important for growing multi-tenant applications.
ACID compliance: atomicity, consistency, isolation, and durability (ACID) compliance ensures data integrity and consistency across transactions.
Extensibility: custom functions and extensions can be added to tailor the database to specific RAG needs.
Cost-effective: as an open-source solution, it can be more cost-effective than some cloud-based alternatives.

Using PostgreSQL for multi-tenant RAG applications also gives you the advantage of Timescale Cloud’s stack of open-source extensions to easily build and scale RAG, search, and agents applications. In addition to pgvector, this stack includes pgvectorscale (which builds on pgvector for enhanced performance and scale) and pgai (which brings embedding creation and large language model completions to the database, giving more PostgreSQL developers the skills of AI engineers). Both extensions complement pgvector and rely on its capabilities.

Implementing RAG with PostgreSQL

Implementing RAG with PostgreSQL involves a multi-step process that leverages the database's vector storage capabilities. The workflow typically includes ingesting and chunking data, converting text into vector embeddings using an embedding model, and storing these vectors in PostgreSQL using pgvector.

When a user query is received, the system retrieves the most relevant data from the vector database based on similarity search. This retrieved information is then combined with the user's question and any additional context to create a comprehensive prompt for the large language model (LLM). The LLM processes this enriched prompt and generates a response, which is then returned to the user, providing a more accurate and contextually relevant answer.

Strategies for Dealing With Multi-Tenancy for RAG in PostgreSQL

To pick the right strategy for your multi-tenant RAG application with PostgreSQL, consider your requirements (and your users’ or customers’ requirements) for shared resources, data separation, customization, scalability, and of course, costs.

PostgreSQL offers four levels of multi-tenancy implementation—table, schema, logical database, and database service—each suitable for distinct use case scenarios and each with its pros and cons. Here’s a comparative overview of each level.

Table-level multi-tenancy gives each tenant its own table. This is simple but may lead to data isolation concerns. Table-level isolation works well for simple apps with shared data.
Schema-level separation gives each tenant its own schema in the same logical database. It offers better isolation with minimal operational and cost overhead, balancing isolation and efficiency for most use cases.
Logical database separation gives each tenant its own logical database in a database instance. Separation at the logical database level provides stronger separation but increases complexity. Logical databases suit clients needing stricter separation.
Database service separation gives each tenant their own database service. Service-level separation is ideal for high-security scenarios or clients with unique needs. It offers the highest isolation but at the cost of increased resources and management overhead.

Overview of approaches for handling multi-tenancy in PostgreSQL

Conclusion

By carefully considering the optimal use cases, pros, and cons of each multi-tenancy approach and aligning them with your application's needs, you can create a scalable, secure, and performant RAG system in PostgreSQL. As RAG technologies continue to evolve, PostgreSQL's extensibility and strong community support ensure that it will remain an adaptable platform for building sophisticated multi-tenant AI applications.

Additionally, PostgreSQL on Timescale Cloud allows you to store your relational, time series, events, semi-structured, and vector data in one place. This removes the operational complexity of managing a separate vector database. It can deliver performance, rich capabilities, and user experience equal to or better than a specialized tool.

Create a free account to try Timescale Cloud's open-source AI stack today (including pgvector, pgai, and pgvectorscale).

Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai

Avthar Sewrathan — Fri, 09 Aug 2024 18:01:55 GMT

Enterprise-ready LLMs from Cohere—now available in the pgai PostgreSQL extension.

Cohere is a leading generative AI company in the field of large language models (LLMs) and retrieval-augmented generation (RAG) systems. What sets Cohere apart from other model developers is its unwavering focus on enterprise needs and support for multiple languages.

Cohere models

Cohere has quickly gained traction among businesses and developers alike, thanks to its suite of models that cater to key steps of the AI application-building process: text embedding, result reranking, and reasoning. Here’s an overview of Cohere models:

Cohere Embed: A leading text representation model supporting over 100 languages, with the latest version (Embed v3) capable of evaluating document quality and relevance to queries. Cohere Embed is ideal for semantic search, retrieval-augmented generation (RAG), clustering, and classification tasks.
Cohere Rerank: A model that significantly improves search quality for any keyword or vector search system with minimal code changes. It’s optimized for high throughput and reduced compute requirements while leveraging Cohere's embedding performance for accurate reranking. Cohere rerank is system agnostic, meaning it can be used with any vector search system, including PostgreSQL.
Cohere Command: A family of highly scalable language models optimized for enterprise use. Command supports RAG, multi-language use, tool use, and citations. Command R+ is the most advanced model, as it’s optimized for conversational interaction and long-context tasks. Command R is well suited for simpler RAG and single-step tool use tasks and is the most cost-effective choice out of the Command family.

🎉

Today, we’re thrilled to announce that PostgreSQL developers can now harness the power of Cohere's enterprise-grade language models directly on PostgreSQL data using the pgai PostgreSQL extension.

What is pgai?

Pgai is an open-source PostgreSQL extension that brings AI models closer to your data, simplifying tasks such as embedding creation, text classification, semantic search, and retrieval-augmented generation on data stored in PostgreSQL.

Pgai supports the entire suite of Cohere Embed, Rerank, and Command models. This means that developers can now:

Create embeddings using Cohere’s Embed model for data inside PostgreSQL tables, all without having pipe data out and back into the database.
Build hybrid search systems for higher quality results in search and RAG applications using Cohere Rerank, combining vector search in pgvector and PostgreSQL full-text search.
Perform tasks like classification, summarization, data enrichment, and other reasoning tasks on data in PostgreSQL tables using Cohere Command models.
Build highly accurate RAG systems completely in SQL, leveraging the Cohere Embed, Rerank, and Command models altogether.

We built pgai to give more PostgreSQL developers AI Engineering superpowers. Pgai makes it easier for database developers familiar with PostgreSQL to become "AI engineers" by providing familiar SQL-based interfaces to AI functionalities like embedding, creating, and model usage.

While Command is the flagship model for building scalable, production-ready AI applications, pgai supports a variety of models in the Cohere lineup. This includes specialized embedding models that support over 100 languages, enabling developers to select the most suitable model for their specific enterprise use case.

Whether it's building internal AI knowledge assistants, security documentation AI, customer feedback analysis, or employee support systems, Cohere's models integrated with pgai and PostgreSQL offer powerful solutions.

Getting started with Cohere models in pgai

Ready to start building with Cohere's suite of powerful language models and PostgreSQL? Pgai is open source under the PostgreSQL License and is available for immediate use in your AI projects. You can find installation instructions on the pgai GitHub repository. You can also access pgai (alongside pgvector and pgvectorscale) on any database service on Timescale’s Cloud PostgreSQL platform. If you’re new to Timescale, you can get started with a free cloud PostgreSQL database here.

Once connected to your database, create the pgai extension by running:

CREATE EXTENSION IF NOT EXISTS ai CASCADE;

Join the Postgres AI Community

Have questions about using Cohere models with pgai? Join the Postgres for AI Discord, where you can share your projects, seek help, and collaborate with a community of peers. You can also open an issue on the pgai GitHub (and while you’re there, stars are always appreciated ⭐).

Next, let's explore the benefits of using Cohere models for building enterprise AI applications. And we’ll close with a real-world example of using Cohere models to build a hybrid search system, combining pgvector semantic search with PostgreSQL full-text search.

Let's dive in...

Why Use Cohere Models for RAG and Search Applications?

Cohere's suite of models offers several advantages for enterprise-grade RAG and search applications:

High-quality results: Cohere models excel in understanding business language, providing relevant content, and tackling complex enterprise challenges.

Impact of Cohere Rerank: Semi-structured retrieval accuracy based Recall@5 on TMDB-5k-Movies, WikiSQL, nq-tables, and Cohere annotated datasets (higher is better). Source: Cohere blog.

Multi-language support: With support for over 100 languages, Cohere models are ideal for global enterprises. For example, Command R+ is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic. Additionally, pre-training data has been included for the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.

Flexible deployment: Cohere’s flexible deployment options allow businesses to bring Cohere's models to their own data, either on their servers or in private/commercial clouds, addressing critical security and compliance concerns. This flexibility, combined with Cohere's robust API and partnerships with major cloud providers like Amazon Web Services (through the Bedrock AI platform), ensures seamless integration and improved scalability for enterprise applications.

Enterprise focus: Strategic partnerships with industry leaders like Fujitsu, Oracle, and McKinsey & Company underscore Cohere's enterprise-centric approach.

Cohere models enable sophisticated enterprise AI applications such as:

Investment research assistants
Support chatbots and Intelligent support copilots
Executive AI assistants
Document summarization tools
Knowledge and project staffing assistants
Regulatory compliance monitoring
Sentiment analysis for brand management
Research and development assistance

Cohere's models offer a powerful, scalable, and easily implementable solution for enterprise search and RAG applications, balancing high accuracy with efficient performance.

Tutorial: Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai

Process diagram showing how pgai combines full text, semantic search, and Cohere’s models to find the most relevant results to a user query using hybrid search.

Let’s look at an example of why Cohere's support in pgai is such a game-changer. We’ll perform a hybrid search over a corpus of news articles, combining PostgreSQL full-text search with pgvector’s semantic search, leveraging Cohere Embed and Rerank models in the process.

Here's what we'll cover:

Setting up your environment: Creating a Python virtual environment, installing necessary libraries, and setting up a PostgreSQL database with the pgai extension.
Preparing your data: Creating a table to store news articles and loading sample data from the CNN/Daily Mail news dataset.
Generating vector embeddings using pgai: Using Cohere's Embed model to create vector representations of text.
Implementing full text and semantic search capabilities: Setting up PostgreSQL's full-text search for keyword matching and creating a vector search index for semantic search.
Performing searches: Executing keyword searches using PostgreSQL's full-text search and semantic searches on embeddings using pgvector HNSW.
Performing hybrid search: Combining keyword and semantic search results by using Cohere's Rerank model to improve result relevance.

By the end of this tutorial, you'll have a powerful hybrid search system that provides highly relevant search results, leveraging both traditional keyword search and semantic search techniques.

Setting up your environment

To begin, we'll set up a Python virtual environment and install the necessary libraries. This ensures a clean, isolated environment for our project.

mkdir cohere
cd cohere
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`

pip install datasets "psycopg[binary]" python-dotenv

This setup:

Creates a new directory for our project
Sets up a Python virtual environment
Activates the virtual environment
Installs the required Python packages:
- datasets: For loading our sample dataset
- psycopg: PostgreSQL adapter for Python
- python-dotenv: For managing environment variables

Next, create a .env file in your project directory to store your database connection string and Cohere API key:

DB_URL=postgres://username:password@localhost:5432/your_database
COHERE_API_KEY=your_cohere_api_key_here

Replace the placeholders with your actual database credentials and Cohere API key.

Setting up your PostgreSQL database

Now, let's set up our PostgreSQL database with the necessary extensions and table structure. We’ll create a table to hold our news articles. The embedding column will hold our embeddings which we’ll create using the Cohere Embed model. The tsv column will hold our full-text-search vectors.

Connect to your PostgreSQL database and run the following SQL commands:

-- create the pgai extension
create extension if not exists ai cascade;

-- create a table for the news articles
-- the embedding column will store embeddings from cohere
-- the tsv column will store the full-text-search vector
create table cnn_daily_mail
( id bigint not null primary key generated by default as identity
, highlights text
, article text
, embedding vector(1024)
, tsv tsvector generated always as (to_tsvector('english', article)) stored
);

-- index the full-text-search vector
create index on cnn_daily_mail using gin (tsv);

This SQL script:

Creates the pgai extension
Sets up a table to store our news articles, including columns for the article text, embeddings, and a full-text search vector
Creates an index on the full-text search vector for efficient keyword searches

Load the dataset

Now that our database is set up, we'll load sample data from the CNN/Daily Mail dataset. We'll use Python to download the dataset and efficiently insert it into our PostgreSQL table using the copy command with the binary format.

Create a new file called load_data.py in your project directory and add the following code:

import os
from dotenv import load_dotenv
from datasets import load_dataset
import psycopg


# Load environment variables
load_dotenv()
DB_URL = os.environ["DB_URL"]
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

# Load and prepare the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
test = dataset["test"]
test = test.shuffle(seed=42).select(range(0,1000))

# Insert data into PostgreSQL
with psycopg.connect(DB_URL) as con:
   with con.cursor(binary=True) as cur:
       with cur.copy("copy public.cnn_daily_mail (highlights, article) from stdin (format binary)") as cpy:
           cpy.set_types(['text', 'text'])
           for row in test:
               cpy.write_row((row["highlights"], row["article"]))

print("Data loading complete!")

This script:

Loads environment variables from the .env file
Downloads the CNN/Daily Mail dataset
Selects a random subset of 1,000 articles from the test set
Efficiently inserts the data into our PostgreSQL table using the binary COPY command

To run the script and load the data:

python load_data.py

This approach is much faster than inserting rows one by one, especially for larger datasets. After running the script, you should have 1000 news articles loaded into your cnn_daily_mail table, ready for the next steps in our tutorial.

Note: Before running this script, make sure your .env file is properly set up with the correct database URL.

Create vector embeddings with Cohere Embed and pgai

In this step, we'll use Cohere's Embed model to generate vector embeddings for our news articles. These embeddings will enable semantic search capabilities.

First, set your Cohere API key in your PostgreSQL session:

select set_config('ai.cohere_api_key', '', false) is not null as set_cohere_api_key

The pgai extension will use this key by default for the duration of your database session.

We'll explore two methods to generate embeddings: a simple approach and a more production-ready solution. Choose method 1 if you’re just experimenting and method 2 if you want to see how you’d run this in production.

Method 1: Easy mode (for small datasets)

This is the "easy-mode" approach to generating embeddings for each news article.

-- this is the easy way to create the embeddings
-- but it's one LONG-running update statement
update cnn_daily_mail set embedding = ai.cohere_embed('embed-english-v3.0', article, input_type=>'search_document');

This method is straightforward but can be slow for large datasets. The downside to this is that each call to Cohere is relatively slow. This makes the whole statement long-running.

Method 2: Production mode (recommended for larger datasets)

This method processes rows one at a time, allowing for better concurrency and error handling. We select and lock only a single row at a time, and we commit our work as we go along.

-- this is a more production-appropriate way to generate the embeddings
-- we only lock one row at a time and we commit each row immediately
do $$
declare
    _id bigint;
    _article text;
    _embedding vector(1024);
begin
    loop
        select id, article into _id, _article
        from cnn_daily_mail
        where embedding is null
        for update skip locked
        limit 1;

        if not found then
            exit;
        end if;

        _embedding = ai.cohere_embed('embed-english-v3.0', _article, input_type=>'search_document');
        update cnn_daily_mail set embedding = _embedding where id = _id;
        commit;
    end loop;
end;
$$;

Note that both methods above use the cohere_embed() function provided by the pgai extension to generate embeddings for each article in the cnn_daily_mail table using the Cohere Embed v3 embedding model.

After running one of the above methods, verify that all rows have embeddings:

-- this should return 0
select count(*) from cnn_daily_mail where embedding is null;

Create a vector search index

To optimize vector similarity searches, create an index on the embedding column:

-- index the embeddings
create index on cnn_daily_mail using hnsw (embedding vector_cosine_ops);

This index uses the pgvector HNSW (hierarchical navigable small world) algorithm, which is efficient for nearest-neighbor searches in high-dimensional spaces.

With these steps completed, your database is now set up for both keyword-based and semantic searches. In the next section, we'll explore how to perform these searches using pgai and Cohere's models.

Perform a Keyword Search in PostgreSQL

⚠️

Note to reader: Unfortunately, the news dataset is quite depressing, so please forgive the negativity in the keywords and prompts below. All terms are chosen purely for illustrative and educational purposes.

PostgreSQL's full-text search capabilities allow us to perform complex keyword searches efficiently. We'll use the tsv column we created earlier, which contains the tsvector representation of each article.

-- a full-text-search
-- must include "death" OR "kill"
-- must include "police" AND "car" AND "dog"
select article
from cnn_daily_mail
where article @@ to_tsquery('english', '(death | kill) & police & car & dog')
;

Let's break down this query:

@@ is the text search match operator in PostgreSQL.
to_tsquery('english', '...') converts our search terms into a tsquery type.
The search logic is: (death OR kill) AND police AND car AND dog.

This query will return articles that contain:

Either "death" or "kill" (or their variations)
AND "police" (or variations)
AND "car" (or variations)
AND "dog" (or variations)

This method of searching is fast and efficient, especially for exact keyword matches inside the article text. However, it doesn't capture semantic meaning or handle synonyms well. In the next section, we'll explore how to perform semantic searches using the embeddings we created earlier.

Perform a Semantic Search in PostgreSQL With pgvector, pgai, and Cohere Embed

Semantic search allows us to find relevant articles based on the meaning of a query rather than just matching keywords. We'll use Cohere's Embed model to convert our search query into a vector embedding, then find the most similar article embeddings using pgvector.

-- get the 15 most relevant stories to our question
with q as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=>'search_query'
   ) as q
)
select article
from cnn_daily_mail
order by embedding <=> (select q from q limit 1)
limit 15
;

Let's break this query down:

We use a CTE (Common Table Expression) named q to generate the embedding for our search query using Cohere's model (also named q).
ai.cohere_embed() is a function provided by pgai that interfaces with Cohere's API to create embeddings. See more in the pgai docs.
We specify input_type => 'search_query' to optimize the embedding for search queries.
In the main query, we order the results by the cosine distance (<=>) between each article's embedding and our query embedding.
The LIMIT 15 clause returns the top 15 most semantically similar articles.

This semantic search can find relevant articles even if they don't contain the exact words used in the query. It understands the context and meaning of the search query and matches it with semantically similar content.

Performing Hybrid Search Using Pgai and Cohere Rerank

Hybrid search combines the strengths of both keyword and semantic searches and then uses Cohere's Rerank model to improve the relevance of the results further. This approach can provide more accurate and contextually relevant results than either method alone.

Here's how to perform a hybrid search with reranking:

with full_text_search as
(
   select article
   from cnn_daily_mail
   where article @@ to_tsquery('english', '(death | kill) & police & car & dog')
   limit 15
)
, vector_query as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=>'search_query'
   ) as query_embedding
)
, vector_search as
(
   select article
   from cnn_daily_mail
   order by embedding <=> (select query_embedding from vector_query limit 1)
   limit 15
)
, rerank as
(
   select ai.cohere_rerank
   ( 'rerank-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , (
       select jsonb_agg(x.article)
       from
       (
           select *
           from full_text_search
           union
           select * from vector_search
       ) x
     )
   , top_n => 5
   , return_documents => true
   ) as response
)
select
  x.index
, x.document->>'text' as article
, x.relevance_score
from rerank
cross join lateral jsonb_to_recordset(rerank.response->'results') x(document jsonb, index int, relevance_score float8)
order by relevance_score desc
;

Let's break down this query:

full_text_search: Performs a keyword search using PostgreSQL's full-text search capabilities.
vector_query: Creates an embedding for our search query using Cohere's Embed model.
vector_search: Performs a semantic search using the query embedding.
combined_results: Combines the results from both keyword and semantic searches.
reranked_results: Uses Cohere's Rerank model to reorder the combined results based on relevance to the query.
The final SELECT statement extracts and formats the reranked results.

This hybrid approach offers several advantages:

It captures both exact keyword matches and semantically similar content.
The reranking step helps to prioritize the most relevant results.
It can handle cases where either keyword or semantic search alone might miss relevant articles.

To use this query effectively:

Adjust the to_tsquery parameters to match your specific keyword search needs.
Modify the natural language query in both the ai.cohere_embed and ai.cohere_rerank functions to match your search intent.
Experiment with the LIMIT values and the top_n parameter to balance between recall and precision.

Remember, while this approach can provide highly relevant results, it does involve multiple API calls to Cohere (for embedding and reranking), which may impact performance for large result sets or high-volume applications.

This hybrid search method demonstrates the power of combining traditional database search techniques with advanced AI models, all within your PostgreSQL database using pgai.

Get Started With Cohere and Pgai Today

The integration of Cohere’s Command, Embed, and Rerank model into pgai marks a significant milestone in our vision to help PostgreSQL evolve into an AI database. By bringing these state-of-the-art language models directly into your database environment, you can unlock new levels of efficiency, intelligence, and innovation in your projects.

Pgai is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the pgai GitHub repository. You can also access pgai on any database service on Timescale’s cloud PostgreSQL platform. If you’re new to Timescale, you can get started with a free cloud PostgreSQL database here.

Pgai is an effort to enrich the PostgreSQL ecosystem for AI. If you’d like to help, here’s how you can get involved:

Got questions about using Cohere models in pgai? Join the Postgres for AI Discord, a community of developers building AI applications with PostgreSQL. Share what you’re working on, and help or get helped by a community of peers.
Share the news with your friends and colleagues: Share our posts about pgai on X/Twitter, LinkedIn, and Threads. We promise to RT back.
Submit issues and feature requests: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve pgai. Head over to the pgai GitHub repo to share your ideas.
Make a contribution: We welcome community contributions for pgai. Pgai is written in Python and PL/Python. Let us know which models you want to see supported, particularly for open-source embedding and generation models. See the pgai GitHub for instructions to contribute.
Offer the pgai extension on your PostgreSQL cloud: Pgai is an open-source project under the PostgreSQL License. We encourage you to offer pgai on your managed PostgreSQL database-as-a-service platform and can even help you spread the word. Get in touch via our Contact Us form and mention pgai to discuss further.

We're excited to see what you'll build with PostgreSQL and pgai!

Nearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They Work

Matvey Arye — Fri, 30 Jun 2023 13:03:10 GMT

The rising popularity of ChatGPT, OpenAI, and applications of Large Language Models (LLMs) has brought the concept of approximate nearest neighbor search (ANN) to the forefront and sparked a renewed interest in vector databases due to the use of embeddings. Embeddings are mathematical representations of phrases that capture the semantic meaning as a vector of numerical values.

What makes this representation fascinating—and useful—is that phrases with similar meanings will have similar vector representations, meaning the distance between their respective vectors will be small. We recently discussed one application of these embeddings, retrieval-augmented generation—augmenting base LLMs with knowledge that it wasn’t trained on—but there are numerous other applications as well.

Semantic similarity search

One common application of embeddings is precisely semantic similarity search. The basic concept behind this approach is that if I have a knowledge library consisting of various phrases and I receive a question from a user, I can locate the most relevant information in my library by finding the data that is most similar to the user's query.

This is in contrast to lexical or full-text search, which only returns exact matches for the query. The remarkable aspect of this technique is that, since the embeddings represent the semantics of the phrase rather than its specific wording, I can find pertinent information even if it is expressed using completely different words!

The challenge of speed at scale

Semantic similarity search involves calculating an embedding for the user's question and then searching through my library to find the K most relevant items related to that question—these are the K items whose embeddings are closest to that of the question. However, when dealing with a large library, it becomes crucial to perform this search efficiently and swiftly. In the realm of vector databases, this problem is referred to as "Finding the k nearest neighbors" (KNN).

This post discusses a method to enhance the speed of this search when utilizing PostgreSQL and pgvector for storing vector embeddings: the Inverted File Flat (IVFFlat) algorithm for approximate nearest neighbor search. We’ll cover why IVFFlat is useful, how it works, and best practices for using it in pgvector for fast similarity search over embeddings vectors.

Let’s go!

P.S. If you’re looking for the fastest vector search index on PostgreSQL, check out pgvectorscale.

What Are IVFFlat Indexes?

IVFFlat indexes, short for Inverted File with Flat Compression, are a type of vector index used in PostgreSQL's pgvector extension to speed up similarity searches to find vectors that are close to a given query. This index type uses approximate nearest neighbor search (ANNS) to provide fast searches.

These indexes work by dividing the vectors into multiple lists, known as clusters. Each cluster represents a region of similar vectors, and an inverted index is built to map each region to its corresponding vectors. When a query comes in, the nearest clusters to the query are identified and only the vectors in those clusters are searched. Thus, this approach significantly reduces the scope of similarity searches by excluding all the vectors that are not in the clusters that are close to the query.

Why Use the IVFFlat Index in Pgvector

Searching for the k-nearest neighbors is not a novel problem for PostgreSQL. PostGIS, a PostgreSQL extension for handling location data, stores its data points as two-dimensional vectors (longitude and latitude). Locating nearby locations is a crucial query in that domain.

PostGIS tackles this challenge by employing an index known as an R-Tree, which yields precise results for k-nearest neighbor queries. Similar techniques, such as KD-Trees and Ball Trees, are also employed for this type of search in other databases.

"The curse of dimensionality"

However, there's a catch. These approaches cease to be effective when dealing with data larger than approximately 10 dimensions due to the "curse of dimensionality." Cue the ominous music! Essentially, as you add more dimensions, the available space increases exponentially, resulting in exponentially sparser data. This reduced density renders existing indexing techniques, like the aforementioned R-Tree, KD-Trees, and Ball Trees, which rely on partitioning the space, ineffective. (To learn more, I suggest these two videos: 1, 2).

Given that embeddings often consist of more than a thousand dimensions—OpenAI’s are 1,536—new techniques had to be developed. There are no known exact algorithms for efficiently searching in such high-dimensional spaces. Nevertheless, there are excellent approximate algorithms that fall into the category of approximate nearest neighbor algorithms. Numerous such algorithms exist, but in this article, we will delve into the Inverted File Flat or IVFFlat algorithm, which is provided by pgvector.

How the IVFFlat Index Works in pgvector

How IVFFlat divides the space

To gain an intuitive understanding of how IVFFlat works, let's consider a set of vectors represented in a two-dimensional space as the following points:

A set of vectors represented as points in two dimensions

In the IVFFlat algorithm, the first step involves applying k-means clustering to the vectors to find cluster centroids. In the case of the given vectors, let's assume we perform k-means clustering and identify four clusters with the following centroids.

After k-means clustering, we identify four clusters indicated by the colored triangles

After computing the centroids, the next step is to assign each vector to its nearest centroid. This is accomplished by calculating the distance between the vector and each centroid and selecting the centroid with the smallest distance as the closest one. This process conceptually maps each point in space to the closest centroid based on proximity.

By establishing this mapping, the space becomes divided into distinct regions surrounding each centroid (technically, this kind of division is called a Voronoi Diagram). Each region represents a cluster of vectors that exhibit similar characteristics or are close in semantic meaning.

This division enables efficient organization and retrieval of approximate nearest neighbors during subsequent search operations, as vectors within the same region are likely to be more similar to each other than those in different regions.

The process of assigning each vector to its closest centroid conceptually divides the space into distinct regions that surround each centroid

Building the IVFFlat index in pgvector

IVFFlat proceeds to create an inverted index that maps each centroid to the set of vectors within the corresponding region. In pseudocode, the index can be represented as follows:

inverted_index = {
  centroid_1: [vector_1, vector_2, ...],
  centroid_2: [vector_3, vector_4, ...],
  centroid_3: [vector_5, vector_6, ...],
  ...
}

Here, each centroid serves as a key in the inverted index, and the corresponding value is a list of vectors that belong to the region associated with that centroid. This index structure allows for efficient retrieval of vectors in a region when performing similarity searches.

Searching the IVFFlat index in pgvector

Let's imagine we have a query for the nearest neighbors to a vector represented by a question mark, as shown below:

We want to find nearest neighbors to the vector represented by the question mark

To find the approximate nearest neighbors using IVFFlat, the algorithm operates under the assumption that the nearest vectors will be located in the same region as the query vector. Based on this assumption, IVFFlat employs the following steps:

Calculate the distance between the query vector (red question mark) and each centroid in the index.
Select the centroid with the smallest distance as the closest centroid to the query (the blue centroid in this example).
Retrieve the vectors associated with the region corresponding to the closest centroid from the inverted index.
Compute the distances between the query vector and each of the vectors in the retrieved set.
Select the K vectors with the smallest distances as the approximate nearest neighbors to the query.

The use of the index in IVFFlat accelerates the search process by restricting the search to the region associated with the closest centroid. This results in a significant reduction in the number of vectors that need to be examined during the search. Specifically, if we have C clusters (centroids), on average, we can reduce the number of vectors to search by a factor of 1/C.

Searching at the edge

The assumption that the nearest vectors will be found in the same region as the query vector can introduce recall errors in IVFFlat. Consider the following query:

IVFFlat can sometimes make errors when searching for nearest neighbors to a point at the edge of two regions of the vector space

From visual inspection, it becomes apparent that one of the light-blue vectors is closer to the query vector than any of the dark-blue vectors, despite the query vector falling within the dark-blue region. This illustrates a potential error in assuming that the nearest vectors will always be found within the same region as the query vector.

To mitigate this type of error, one approach is to search not only the region of the closest centroid but also the regions of the next closest R centroids. This approach expands the search scope and improves the chances of finding the true nearest neighbors.

In pgvector, this functionality is implemented through the `probes` parameter, which specifies the number of centroids to consider during the search, as described below.

Parameters for Pgvector’s IVFFlat Implementation

In the implementation of IVFFlat in pgvector, two key parameters are exposed: lists and probes.

Lists parameter in pgvector

The lists parameter determines the number of clusters created during index building (It’s called lists because each centroid has a list of vectors in its region). Increasing this parameter reduces the number of vectors in each list and results in smaller regions.

It offers the following trade-offs to consider:

Higher lists value speeds up queries by reducing the search space during query time.
However, it also decreases the region size, which can lead to more recall errors by excluding some points.
Additionally, more distance comparisons are required to find the closest centroid during step one of the query process.

Here are some recommendations for setting the lists parameter:

For datasets with less than one million rows, use lists = rows / 1000.
For datasets with more than one million rows, use lists = sqrt(rows).
It is generally advisable to have at least 10 clusters.

Probes parameter in pgvector

The probes parameter is a query-time parameter that determines the number of regions to consider during a query. By default, only the region corresponding to the closest centroid is searched. By increasing the probes parameter, more regions can be searched to improve recall at the cost of query speed.

The recommended value for the probes parameter is probes = sqrt(lists).

Using IVFFlat in Pgvector

Creating an index

When creating an index, it is advisable to have existing data in the table, as it will be utilized by k-means to derive the centroids of the clusters.

The index in pgvector offers three different methods to calculate the distance between vectors: L2, inner product, and cosine. It is essential to select the same method for both the index creation and query operations. The following table illustrates the query operators and their corresponding index methods:

Distance type	Query operator	Index method
L2 / Euclidean	<->	vector_l2_ops
Negative Inner product	<#>	vector_ip_ops
Cosine	<=>	vector_cosine_ops

Note: OpenAI recommends cosine distance for its embeddings.

To create an index in pgvector using IVFFlat, you can use a statement using the following form:

CREATE INDEX ON  USING ivfflat ( ) WITH (lists = );

Replace 
 with the name of your table and  with the name of the column that contains the vector type.
For example, if our table is named embeddings and our embedding vectors are in a column named embedding, we can create an IVFFlat index as follows:
CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Here’s a simple Python function that you can use to create an IVFFlat index with the correct parameters for lists and probes as discussed above:
def create_ivfflat_index(conn, table_name, column_name, query_operator="<=>"): 
    index_method = "invalid"
    if query_operator == "<->":
        index_method = "vector_l2_ops"
    elif query_operator == "<#>":
        index_method = "vector_ip_ops"
    elif query_operator == "<=>":
        index_method = "vector_cosine_ops"
    else:
        raise ValueError(f"unrecognized operator {query_operator}")

    with conn.cursor() as cur:
        cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")
        num_records = cur.fetchone()[0]

        num_lists = num_records / 1000
        if num_lists < 10:
            num_lists = 10
        if num_records > 1000000:
            num_lists = math.sqrt(num_records)

        cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')
        conn.commit()
Querying
An index can be used whenever there is an ORDER BY of the form column   along with a LIMIT k;
Some examples

Get the closest two vectors to a constant vector:
SELECT * FROM my_table ORDER BY embedding_column <=> '[1,2]' LIMIT 2;
This is a common usage pattern in retrieval augmented generation using LLMs, where we find the embedding vectors that are closest in semantic meaning to the user’s query. In that case, the constant vector would be the embedding vector representing the user’s query. 
You can see an example of this in our guide to creating, storing, and querying OpenAI embeddings with pgvector, where we use this Python function to find the three most similar documents to a given user query from our database:
# Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN <=> operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs
Get the closest vector to some row:
SELECT * FROM my_table WHERE id != 1 ORDER BY embedding_column <=> (SELECT embedding_column FROM my_table WHERE id = 1) LIMIT 2;
Tip: PostgreSQL's ability to use an index does not guarantee its usage! The cost-based planner evaluates query plans and may determine that a sequential scan or a different index is more efficient for a specific query. You can use the EXPLAIN command to see the chosen execution plan. To test the viability of using an index, you can modify planner costing parameters until you achieve the desired plan. For small datasets, setting enable_seqscan = 0 can be especially advantageous for testing viability as it avoids sequential scans.
To adjust the probes parameter, you can set the ivfflat.probes variable. For instance, to set it to '5', execute the following statement before running the query:
SET ivfflat.probes = 5;
Dealing with data changes
As your data evolves with inserts, updates, and deletes, the IVFFlat index will be updated accordingly. New vectors will be added to the index, while no longer-used vectors will be removed. 
However, the clustering centroids will not be updated. Over time, this can result in a situation where the initial clustering, established during index creation, no longer accurately represents the data. This can be visualized as follows:
As data gets inserted or deleted from the index, if the index is not rebuilt, the IVFFlat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well
To address this issue, the only solution is to rebuild the index.
Here are two important takeaways from this issue:
Build the index once you have all the representative data you want to reference in it. This is unlike most indexes, which can be built on an empty table.
It is advisable to periodically rebuild the index.
When rebuilding the index, it is highly recommended to use the CONCURRENTLYoption to avoid interfering with ongoing operations.
Thus, to rebuild the index run the following in a cron job:
REINDEX INDEX CONCURRENTLY ;
Summing It Up
The IVFFlat algorithm in pgvector provides an efficient solution for approximate nearest neighbor search over high-dimensional data like embeddings. It works by clustering similar vectors into regions and building an inverted index to map each region to its vectors. This allows queries to focus on a subset of the data, enabling fast search. By tuning the lists and probes parameters, IVFFlat can balance speed and accuracy for a dataset.  
Overall, IVFFlat gives PostgreSQL the ability to perform fast semantic similarity search over complex data. With simple queries, applications can find the nearest neighbors to a query vector among millions of high-dimensional vectors. For natural language processing, information retrieval, and more, IVFFlat is a compelling solution. By understanding how IVFFlat divides the vector space into regions and builds its inverted index, you can optimize its performance for your needs and build powerful applications on top of it.
✨Resources for further learning: Now that you know more about the IVFFlat index in pgvector, here are some resources to further your learning journey: 
Learn about other PostgreSQL indexes for vector search, like HNSW.
Learn how we made PostgreSQL as fast as Pinecone for vector data.
Follow our tutorial on creating, storing, and querying OpenAI embeddings using PostgreSQL as a vector database. Learn how to use pgvector as a vector store in LangChain. Or see how you can refine vector search queries using time filters in pgvector with a single SQL query.
And if you’re looking for a production-ready PostgreSQL database for your AI application’s vector, relational, and time-series data, try Timescale Cloud.

PostgreSQL as a Vector Database: A Pgvector Tutorial
Avthar Sewrathan — Wed, 21 Jun 2023 18:22:10 GMT
Vector databases enable efficient storage and searching of vector data. They are essential for developing and maintaining AI applications using large language models (LLMs).
With some help from the pgvector extension, you can leverage PostgreSQL as a vector database to store and query OpenAI embeddings. OpenAI embeddings are a type of data representation (in the shape of vectors, i.e., lists of numbers) used to measure the similarity of text strings for OpenAI’s models.
In this article, we work through the example of creating a chatbot to answer questions about Tiger Data (creators of TimescaleDB). The chatbot will be trained on content from the Tiger Data Developer Q&A blog posts. This example will illustrate the key concepts for creating, storing, and querying OpenAI embeddings with PostgreSQL and pgvector.
This example has three parts:
Part 1: How to create embeddings from content using the OpenAI API.
Part 2: How to use PostgreSQL as a vector database and store OpenAI embedding vectors using pgvector.
Part 3: How to use embeddings retrieved from a vector database to augment LLM generation.
One could think of this as a “hello world” tutorial for building a chatbot that can reference a company knowledge base or developer docs.
✨
Jupyter Notebook and Code: You can find all the code used in this tutorial in a Jupyter Notebook, as well as sample content and embeddings on the Tiger Data GitHub: timescale/vector-cookbook. We recommend cloning the repo and following along by executing the code cells as you read through the tutorial.
The Big Picture: OpenAI Embeddings
Foundational models of AI (e.g., GPT-3 or GPT-4) may be missing some information needed to provide accurate answers to certain specific questions. That’s because relevant information was not in the dataset used to train the model. (For example, the information is stored in private documents or only became available recently.) This lack of data may make these models unsuitable for use as a chatbot in specific information banks.
Retrieval-augmented generation (RAG) gives a simple solution; it provides additional context to the foundational model in the prompt. 
This technique is powerful—it allows you to “teach” foundational models about things only you know about and use that to create a ChatGPT++ experience for your users!
But what context should you provide to the model? If you have a library of information, how can you determine what’s relevant to a given question? That is what embeddings are for. OpenAI embeddings are a mathematical representation of the semantic meaning of a piece of text that allows for similarity search.
With this representation, when you get a user question and calculate its embedding, you can use a similarity search against data embeddings in your library to find the most relevant information. But that requires having an embedding representation of your library.  
What is a vector database?
A vector database is a database that can handle vector data. Vector databases are useful for:
Semantic search: Vector databases facilitate semantic search, which considers the context or meaning of search terms rather than just exact matches. They are useful for recommendation systems, content discovery, and question-answering systems.
Efficient similarity search: Vector databases are designed for efficient high-dimensional nearest neighbor search, a task where traditional relational databases struggle.
Machine learning: Vector databases store and search embeddings created by machine-learning models. This feature aids in finding items semantically similar to a given item.
Multimedia data handling: Vector databases also excel in working with multimedia data (images, audio, video) by converting them into high-dimensional vectors for efficient similarity search.
NLP and data combination: In natural language processing (NLP), vector databases store high-dimensional vectors representing words, sentences, or documents. They also allow a combination of traditional SQL queries with similarity searches, accommodating both structured and unstructured data.
We’ll use PostgreSQL with the pgvector extension installed as our vector database. Pgvector extends PostgreSQL to handle vector data types and vector similarity search, like nearest neighbor search, which we’ll use to find the k most related embeddings in our database for a given user prompt.
Using Pgvector for a PostgreSQL Vector Database
Pgvector is an open-source extension for PostgreSQL that enables storing and searching over machine learning-generated embeddings. It provides different capabilities that allow users to identify exact and approximate nearest neighbors. Pgvector is designed to work seamlessly with other PostgreSQL features, including indexing and querying.
Now we’re ready to start building our chatbot!
Why use pgvector as a vector database?
Here are five reasons why PostgreSQL is a good choice for storing and handling vector data:
Integrated solution: By using PostgreSQL as a vector database, you keep your data in one place. This can simplify your architecture by reducing the need for multiple databases or additional services.
Enterprise-level robustness and operations: With a 30-year pedigree, PostgreSQL provides world-class data integrity, operations, and robustness. This includes backups, streaming replication, role-based and row-level security, and ACID compliance.
Full-featured SQL: PostgreSQL supports a rich set of SQL features, including joins, subqueries, window functions, and more. This allows for powerful and complex queries that can include both traditional relational data and vector data. It also integrates with a plethora of existing data science and data analysis tools.
Scalability and performance: PostgreSQL is known for its robustness and ability to handle large datasets. Using it as a vector database allows you to leverage these characteristics for vector data as well.
Open source: PostgreSQL is open source, which means it's free to download and use, and you can modify it to suit your needs. It also means that it benefits from the collective input of developers all over the world, which often results in high-quality, secure, and up-to-date software. PostgreSQL has a large and active community, so help is readily available. There are many resources, including documentation, tutorials, forums, and more, to help you troubleshoot and optimize your PostgreSQL database.
setting
Install Python.
Install and configure a Python virtual environment. We recommend Pyenv.
Install the requirements for this notebook using the following command:
pip install -r requirements.txt

Import all the packages we will be using:
import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector

You’ll need to sign up for an OpenAI Developer Account and create an OpenAI API Key – we recommend getting a paid account to avoid rate limiting and setting a spending cap so that you avoid any surprises with bills.
Once you have an OpenAI API key, it’s a best practice to store it as an environment variable and then have your Python program read it.
#First, run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...


# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key  = os.environ['OPENAI_API_KEY'] 

Part 1: Create Embeddings for Your PostgreSQL Vector Database
Embeddings measure how related text strings are. First, we'll create embeddings using the OpenAI API on some text we want the LLM to answer questions on.
In this example, we'll use content from the Tiger Data blog, specifically from the Developer Q&A section, which features posts by Tiger Data users talking about their real-world use cases.
You can replace this blog data with any text you want to embed, such as your own company blog, developer documentation, internal knowledge base, or any other information you’d like to have a “ChatGPT-like” experience over.
# Load your CSV file into a pandas DataFrame
df = pd.read_csv('blog_posts_data.csv')
df.head()

The output looks like this:











Title Content URL
0 How to Build a Weather Station With Elixir, Ne... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-to-build-a-...
1 CloudQuery on Using PostgreSQL for Cloud Asset... This is an installment of our “Community Membe... https://www.timescale.com/blog/cloudquery-on-u...
2 How a Data Scientist Is Building a Time-Series... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-a-data-scie...
3 How Conserv Safeguards History: Building an En... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-conserv-saf...
4 How Messari Uses Data to Open the Cryptoeconom... This is an installment of our “Community Membe... https://www.timescale.com/blog/how-messari-use...

1.1 Calculate the cost of embedding data
It's usually a good idea to calculate how much creating embeddings for your selected content will cost. We provide a number of helper functions to calculate a cost estimate before creating the embeddings to help us avoid surprises.
For OpenAI, you are charged on a per-token basis for embeddings created. The total cost for the blog posts we want to embed will be less than $0.01, thanks to OpenAI’s small text embedding model, text-embedding-3-small. This model boasts not only stronger performance but also 5X cost reduction compared to its predecessor, text-embedding-ada-002.
# Helper functions to help us create the embeddings

# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -> int:
    if not string:
        return 0
    # Returns the number of tokens in a text string
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Helper function: calculate length of essay
def get_essay_length(essay):
    word_list = essay.split()
    num_words = len(word_list)
    return num_words

# Helper function: calculate cost of embedding num_tokens
# Assumes we're using the text-embedding-ada-002 model
# See https://openai.com/pricing
def get_embedding_cost(num_tokens):
    return num_tokens/1000*0.0002

# Helper function: calculate total cost of embedding all content in the dataframe
def get_total_embeddings_cost():
    total_tokens = 0
    for i in range(len(df.index)):
        text = df['content'][i]
        token_len = num_tokens_from_string(text)
        total_tokens = total_tokens + token_len
    total_cost = get_embedding_cost(total_tokens)
    return total_cost


# quick check on total token amount for price estimation
total_cost = get_total_embeddings_cost()
print("estimated price to embed this content = $" + str(total_cost))


1.2 Create smaller chunks of content
The OpenAI API has a maximum token limit that it can create an embedding for in a single request: 8,191 to be specific. To get around this limit, we'll break up our text into smaller chunks. Generally, it's a best practice to “chunk” the documents you want to create embeddings into groups of a fixed token size.
The precise number of tokens to include in a chunk depends on your use case and your model’s context window—the number of input tokens it can handle in a prompt.
For our purposes, we'll aim for chunks of around 512 tokens each. Chunking text up is a complex topic worthy of its own blog post. We’ll illustrate a simple method we found to work well below.  If you want to read about other approaches, we recommend this section of the LangChain docs.
Note: If you prefer to skip this step, you can use the provided file: blog_data_and_embeddings.csv, which contains the data and embeddings that you'll generate in this step.
The code below creates a new list of our blog content while retaining the metadata associated with the text, such as the blog title and URL that the text is associated with.

# Create new list with small content chunks to not hit max token limits
# Note: the maximum number of tokens for a single request is 8191
# https://platform.openai.com/docs/guides/embeddings/embedding-models

# list for chunked content and embeddings
new_list = []
# Split up the text into token sizes of around 512 tokens
for i in range(len(df.index)):
    text = df['content'][i]
    token_len = num_tokens_from_string(text)
    if token_len <= 512:
        new_list.append([df['title'][i], df['content'][i], df['url'][i], token_len])
    else:
        # add content to the new list in chunks
        start = 0
        ideal_token_size = 512
        # 1 token ~ 3/4 of a word
        ideal_size = int(ideal_token_size // (4/3))
        end = ideal_size
        #split text by spaces into words
        words = text.split()

        #remove empty spaces
        words = [x for x in words if x != ' ']

        total_words = len(words)
        
        #calculate iterations
        chunks = total_words // ideal_size
        if total_words % ideal_size != 0:
            chunks += 1
        
        new_content = []
        for j in range(chunks):
            if end > total_words:
                end = total_words
            new_content = words[start:end]
            new_content_string = ' '.join(new_content)
            new_content_token_len = num_tokens_from_string(new_content_string)
            if new_content_token_len > 0:
                new_list.append([df['title'][i], new_content_string, df['url'][i], new_content_token_len])
            start += ideal_size
            end += ideal_size


Now that our text is chunked better, we can create embeddings for each chunk of text using the OpenAI API.
We’ll use this helper function to create embeddings for a piece of text:
openai_client = openai.OpenAI()

# Helper function: get embeddings for a text
def get_embeddings(text):
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input = text.replace("\n"," ")
    )
    return response.data[0].embedding

And then create embeddings for each chunk of content:
# Create embeddings for each piece of content
for i in range(len(new_list)):
   text = new_list[i][1]
   embedding = get_embeddings(text)
   new_list[i].append(embedding)

# Create a new dataframe from the list
df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url', 'tokens', 'embeddings'])
df_new.head()


The new data frame should look like this:




  
    
    Title
    Content
    URL
    Tokens
    Embeddings
  


  
    0
    How to Build a Weather Station With Elixir, Ne...
    This is an installment of our “Community Membe...
    https://www.timescale.com/blog/how-to-build-a-...
    501
    [0.021440856158733368, 0.02200360782444477, -0...
  
  
    1
    How to Build a Weather Station With Elixir, Ne...
    capture weather and environmental data. In all...
    https://www.timescale.com/blog/how-to-build-a-...
    512
    [0.016165969893336296, 0.011341351084411144, 0...
  
  
    2
    How to Build a Weather Station With Elixir, Ne...
    command in their database migration:SELECT cre...
    https://www.timescale.com/blog/how-to-build-a-...
    374
    [0.022517921403050423, -0.0019158280920237303,...
  
  
    3
    CloudQuery on Using PostgreSQL for Cloud Asset...
    This is an installment of our “Community Membe...
    https://www.timescale.com/blog/cloudquery-on-u...
    519
    [0.009028822183609009, -0.005185891408473253, ...
  
  
    4
    CloudQuery on Using PostgreSQL for Cloud Asset...
    Architecture with CloudQuery SDK- Writing plug...
    https://www.timescale.com/blog/cloudquery-on-u...
    511
    [0.02050386555492878, 0.010169642977416515, 0....
  




As an optional but recommended step, you can save the original blog content along with associated embeddings in a CSV file for reference later on so that you don't have to recreate embeddings if you want to reference it in another project.
# Save the dataframe with embeddings as a CSV file
df_new.to_csv('blog_data_and_embeddings.csv', index=False)

Pro Tip: Automating Embedding Creation with pgai Vectorizer
In the section above, we showed how to manually create and manage embeddings in your own data pipeline – chunking content, calling the OpenAI API, and storing the results. While this approach helps you understand the fundamentals, in production, you may want to automate this process completely. Let’s look at how pgai Vectorizer can handle this entire pipeline for you! 
Managing embeddings in production involves several challenges: keeping embeddings in sync with changing content, handling API failures, and optimally chunking text. 
pgai Vectorizer automates this entire process directly in PostgreSQL - similar to how PostgreSQL automatically maintains indexes for your tables.
Setting Up pgai Vectorizer
The setup process differs depending on whether you’re using Tiger Cloud (formerly Timescale Cloud) or hosting PostgreSQL yourself. 
On Tiger Cloud
-- 1. Store your OpenAI API key securely in Timescale Cloud
-- 2. Navigate to Project Settings > AI Model API Keys in the Timescale Console
-- 3. The key is stored securely and not in your database
-- 4. Create the extensions
CREATE EXTENSION IF NOT EXISTS ai;
For self-hosted PostgreSQL
export OPENAI_API_KEY="your-api-key-here"

# Start the vectorizer worker
vectorizer-worker --connection="postgres://user:password@host:port/dbname"
Creating Your First Vectorizer
Instead of manually creating embeddings using Python, you can define a vectorizer that automatically generates and maintains embeddings for your content:
SELECT ai.create_vectorizer( 
   'blog_posts'::regclass,
    destination => 'blog_embeddings',
    embedding => ai.embedding_openai('text-embedding-3-small', 768),
    chunking => ai.chunking_recursive_character_text_splitter('content'),
    -- Pro tip: Add blog title as context to each chunk
    formatting => ai.formatting_python_template('$title: $chunk')
);
This single SQL command:
Automatically chunks your blog content
Creates embeddings for each chunk using OpenAI's API
Maintains embeddings as your content changes
Creates a view that joins your content with its embeddings
Searching with Vectorizer
You can then search your content the same way as before:
SELECT 
   chunk,
   embedding <=> ai.openai_embed('text-embedding-3-small', 'How is Timescale used in IoT?') as distance
FROM blog_embeddings
ORDER BY distance
LIMIT 3;
Vectorizer runs automatically every five minutes on Tiger Cloud, handling retries and keeping your embeddings up to date. For more details on setup and advanced features like monitoring the Vectorizer, see our pgai Vectorizer documentation. 
Further Reading on RAG
The accuracy and cost of your RAG application depends heavily on implementation choices such as the embedding model selection to chunking strategies. 
Here are more blog posts to help you build effective RAG applications with PostgreSQL:
Vector Databases Are the Wrong Abstraction – learn why general-purpose databases with vector extensions like pgvectorscale often provide better solutions than specialized vector databases
Which RAG Chunking and Formatting Strategy Is Best? – Explore different approaches to chunking and formatting your content for optimal retrieval-augmented generation (RAG) performance
Which OpenAI Embedding Model Is Best? - Compare OpenAI's embedding models to choose the right one for your use case
Part 2: Store Embeddings in a PostgreSQL Vector Database Using Pgvector
Now that we have created embedding vectors for our blog content, the next step is to store the embedding vectors in a vector database to help us perform a fast search over many vectors.
2.1 Create a PostgreSQL database and install pgvector
First, we’ll create a PostgreSQL database. You can create a cloud PostgreSQL database in minutes for free on Tiger Cloud or use a local PostgreSQL database for this step. 
Once you’ve created your PostgreSQL database, export your connection string as an environment variable, and just like the OpenAI API key, we’ll read it into our Python program from the environment file:
# Timescale database connection string
# Found under "Service URL" of the credential cheat-sheet or "Connection Info" in the Timescale console
# In terminal, run: export TIMESCALE_CONNECTION_STRING=postgres://

connection_string  = os.environ['TIMESCALE_CONNECTION_STRING']


We then connect to our database using the popular psycopg2 Python library and install the pgvector and pgvectorscale extension (which provides powerful filtering and indexing capabilities ) as follows:
# Connect to PostgreSQL database in Timescale using connection string
conn = psycopg2.connect(connection_string)
cur = conn.cursor()

#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()

#install pgvectorscale
cur.execute("CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;")
conn.commit()

2.2 Connect to and configure your vector database
Once we’ve installed pgvector, we use the register_vector() command to register the vector type with our connection:
# Register the vector type with psycopg2
register_vector(conn)

Once we’ve connected to the database, let’s create a table that we’ll use to store embeddings along with metadata. Our table will look as follows:


id title  url content tokens embedding

Id represents the unique ID of each vector embedding in the table.
title is the blog title from which the content associated with the embedding is taken.
url is the blog URL from which the content associated with the embedding is taken.
content is the actual blog content associated with the embedding.
tokens is the number of tokens the embedding represents.
embedding is the vector representation of the content.
One advantage of using PostgreSQL as a vector database is that you can easily store metadata and embedding vectors in the same database, which is helpful for supplying the user-relevant information related to the response they receive, like links to read more or specific parts of a blog post that are relevant to them.
# Create table to store embeddings and metadata
table_create_command = """
CREATE TABLE embeddings (
            id bigserial primary key, 
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()


2.3 Ingest and store vector data into PostgreSQL using pgvector
Now that we’ve created the database and created the table to house the embeddings and metadata, the final step is to insert the embedding vectors into the database. 
For this step, it’s a best practice to batch insert the embeddings rather than insert them one by one.
#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()


Let’s sanity check by running some simple queries against our newly inserted data:
cur.execute("SELECT COUNT(*) as cnt FROM embeddings;")
num_records = cur.fetchone()[0]
print("Number of vector records in table: ", num_records,"\n")
# Correct output should be 129


# print the first record in the table, for sanity-checking
cur.execute("SELECT * FROM embeddings LIMIT 1;")
records = cur.fetchall()
print("First record in table: ", records)

2.4 Index your data for faster retrieval
In this example, we only have 129 embedding vectors, so searching through all of them is blazingly fast. But for larger datasets, you need to create indexes to speed up searching for similar embeddings, so we include the code to build the index for illustrative purposes. 
While pgvector supports the IVFFLAT and HNSW index types for approximate nearest neighbor (ANN) search, pgvectorscale offers a more cost-efficient and powerful index type for pgvector data: StreamingDiskANN, which we use here. 
You always want to build this index after you have inserted the data, as the index needs to discover clusters in your data to be effective, and it does this only when first building the index. 
The StreamingDiskANN index has tunable parameters depending on your goal, whether it is changing indexing operations or querying operations. In our case, we use the default values of the parameters. You can read more about tuning here.
# Create an index on the data for faster retrieval
cur.execute('CREATE INDEX embedding_idx ON embeddings USING diskann (embedding);')
conn.commit()

Part 3: Nearest Neighbor Search Using pgvector
Given a user question, we’ll perform the following steps to use information stored in the vector database to answer their question using Retrieval Augmented Generation:
Create an embedding vector for the user question.
Use pgvector to perform a vector similarity search and retrieve the k nearest neighbors to the question embedding from our embedding vectors representing the blog content. In our example, we’ll use k=3, finding the three most similar embedding vectors and associated content.
Supply the content retrieved from the database as additional context to the model and ask it to perform a completion task to answer the user question.
3.1 Define a question you want to answer
First, we’ll define a sample question that a user might want to answer about the blog posts stored in the database.
# Question about Timescale we want the model to answer
input = "How is Timescale used in IoT?"

Since TimescaleDB is popular for IoT sensor data, a user might want to learn specifics about how they can leverage it for that use case.
3.2 Find the most relevant content in the database
Here’s the function we use to find the three nearest neighbors to the user question. Note it uses pgvector’s <=> operator, which finds the Cosine distance (also known as Cosine similarity) between two embedding vectors.  
# Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN <=> operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs


3.3 Define helper functions to query OpenAI
We define a helper function to get a completion response from an OpenAI model while we use the previously defined helper function, get_embeddings, to create an embedding for the user question. We use GPT-4o, but you can use any other model from OpenAI.
We also specify a number of parameters, such as limits of the maximum number of tokens in the model response and model temperature, which controls the randomness of the model, which you can modify to your liking:
# Helper function: get text completion from OpenAI API
# Note we're using the latest gpt-3.5-turbo-0613 model
def get_completion_from_messages(messages, model="gpt-4o", temperature=0, max_tokens=1000):
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message.content

3.3 Putting it all together
We’ll define a function to process the user input by retrieving the most similar documents from our database and passing the user input, along with the relevant retrieved context to the OpenAI model to provide a completion response to.
Note that we modify the system prompt as well in order to influence the tone of the model’s response.
We pass to the model the content associated with the three most similar embeddings to the user input using the assistant role. You can also append the additional context to the user message.
# Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
    delimiter = "```"

    #Step 1: Get documents related to the user input from database
    related_docs = get_top3_similar_docs(get_embeddings(user_input), conn)

    # Step 2: Get completion from OpenAI API
    # Set system message to help set appropriate tone and context for model
    system_message = f"""
    You are a friendly chatbot. \
    You can answer questions about timescaledb, its features and its use cases. \
    You respond in a concise, technically credible tone. \
    """

    # Prepare messages to pass to model
    # We use a delimiter to help the model understand the where the user_input starts and ends
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{delimiter}{user_input}{delimiter}"},
        {"role": "assistant", "content": f"Relevant Timescale case studies information: \n {related_docs[0][0]} \n {related_docs[1][0]} {related_docs[2][0]}"}   
    ]

    final_response = get_completion_from_messages(messages)
    return final_response

Let’s see an example of the model’s output to our original input question:
input = "How is Timescale used in IoT?"
response = process_input_with_retrieval(input)
print(input)
print(response)

Here’s the model’s response:
User input: How is Timescale used in IoT?

Model Response:

TimescaleDB is widely used in IoT applications due to its ability to efficiently handle time-series data, which is a common requirement in IoT systems. Here are some ways TimescaleDB is utilized in IoT:

1. **Data Ingestion and Storage**: IoT devices generate large volumes of time-stamped data. TimescaleDB's architecture is optimized for ingesting and storing this data efficiently, allowing for high write throughput and efficient storage through features like data compression.

2. **Real-time Analytics**: TimescaleDB supports real-time analytics, enabling IoT applications to process and analyze data as it arrives. This is crucial for applications that require immediate insights, such as monitoring systems or predictive maintenance.

3. **Historical Data Analysis**: With its support for continuous aggregates, TimescaleDB allows for efficient querying of historical data. This is useful for trend analysis, reporting, and long-term data storage, which are common in IoT use cases.

4. **Integration with IoT Platforms**: TimescaleDB can be integrated with various IoT platforms and tools, such as AWS IoT, MQTT, and Grafana, to manage, visualize, and analyze IoT data seamlessly.

5. **Scalability**: As IoT deployments grow, the volume of data can increase significantly. TimescaleDB's scalability features, such as partitioning and distributed hypertables, help manage this growth effectively.

Overall, TimescaleDB provides a robust solution for managing the unique challenges of IoT data, combining the benefits of a relational database with time-series capabilities.

We can also ask the model questions about specific documents in the database, in this case about specific TimescaleDB users who have spoken about their experience in a blog post:
input_2 = "Tell me about Edeva and Hopara. How do they use Timescale?"
response_2 = process_input_with_retrieval(input_2)
print(input_2)
print(response_2)


Here’s the model output for the input question above; notice how it uses specific details from the blog posts.
User input: Tell me about Edeva and Hopara. How do they use Timescale?


Model Response:

### Edeva

**Use Case:**
Edeva leverages TimescaleDB to manage and analyze data from their smart city systems, particularly focusing on their Actibump dynamic speed bump. The data collected includes:

1. **IoT Event Data:** Information such as vehicle speeds and environmental conditions.
2. **Alarm Data:** Alerts for sensor malfunctions or other issues.
3. **Status Data:** Self-checks and statistical data from IoT devices.
4. **Administrative Data:** Metadata about devices, such as configuration details.

**Key Features Utilized:**
- **Continuous Aggregations:** To speed up queries and make dashboards responsive.
- **Percentile Aggregations:** For calculating accurate percentile values without querying raw data.
- **SQL Compatibility:** Simplifies onboarding for developers familiar with SQL.

**Benefits:**
- **Performance:** Transitioned from sluggish to lightning-fast dashboards.
- **Ease of Use:** Developers could quickly adapt due to SQL familiarity.
- **Scalability:** Efficiently handles large datasets, such as hundreds of millions of records.

### Hopara

**Use Case:**
Hopara uses TimescaleDB to manage and visualize time-series data for their geospatial analytics platform. The platform integrates various data sources to provide insights into spatial and temporal trends.

**Key Features Utilized:**
- **Time-Series Data Management:** Efficiently stores and queries large volumes of time-series data.
- **Geospatial Capabilities:** Leverages PostgreSQL’s PostGIS extension for spatial queries.
- **Continuous Aggregations:** To pre-compute and speed up complex queries.

**Benefits:**
- **Scalability:** Handles large datasets with ease.
- **Performance:** Fast query execution for real-time analytics.
- **Integration:** Seamless integration with existing PostgreSQL tools and extensions.

Both Edeva and Hopara benefit from TimescaleDB’s ability to handle large volumes of time-series data efficiently, providing fast query performance and ease of use through SQL compatibility.
Conclusion
Retrieval-augmented generation (RAG) is a powerful method of building applications with LLMs that enables you to teach foundation models about things they were not originally trained on, like private documents or recently published information.
This project is an example of how to create, store, and perform similarity search on OpenAI embeddings. We used PostgreSQL + pgvector + pgvectorscale as our vector database to efficiently store and query the embeddings, enabling precise and relevant responses.
TimescaleDB + PostgreSQL
And if you’re looking for a production PostgreSQL database for your vector workloads, try Timescale. It’s free for 30 days, no credit card required.
Further reading
Here are more blog posts about RAG with PostgreSQL and different tools:
RAG Is More Than Just Vector Search
Retrieval-Augmented Generation With Claude Sonnet 3.5 & Pgvector
Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama
Building an AI Image Gallery: Advanced RAG With Pgvector and Claude Sonnet 3.5

	Title	Content	URL	Tokens	Embeddings
0	How to Build a Weather Station With Elixir, Ne...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/how-to-build-a-...	501	[0.021440856158733368, 0.02200360782444477, -0...
1	How to Build a Weather Station With Elixir, Ne...	capture weather and environmental data. In all...	https://www.timescale.com/blog/how-to-build-a-...	512	[0.016165969893336296, 0.011341351084411144, 0...
2	How to Build a Weather Station With Elixir, Ne...	command in their database migration:SELECT cre...	https://www.timescale.com/blog/how-to-build-a-...	374	[0.022517921403050423, -0.0019158280920237303,...
3	CloudQuery on Using PostgreSQL for Cloud Asset...	This is an installment of our “Community Membe...	https://www.timescale.com/blog/cloudquery-on-u...	519	[0.009028822183609009, -0.005185891408473253, ...
4	CloudQuery on Using PostgreSQL for Cloud Asset...	Architecture with CloudQuery SDK- Writing plug...	https://www.timescale.com/blog/cloudquery-on-u...	511	[0.02050386555492878, 0.010169642977416515, 0....



How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions
JF Joly — Thu, 23 Jun 2022 13:02:54 GMT
Why the Right Tools Matter When Analyzing Time-Series Data
SQL is the lingua franca for analytics. As data proliferates, we need to find new ways to store, explore, and analyze it. We believe SQL is the best language for data analysis. We’ve championed the benefits of SQL for several years, even when many were swapping it for custom domain-specific languages. Full SQL support was one of the key reasons we chose to build TimescaleDB on top of PostgreSQL, the most loved database among developers, rather than creating a custom query language. And we were right—SQL is making a comeback (although it never really went away) and has become the universal language for data analysis, with many NoSQL databases adding SQL interfaces to keep up.
In addition, most developers are familiar with SQL, along with most data scientists, data analysts, and other professionals who work with data. Whether you've taken classes at university, done an online course, or attended a boot camp, chances are that you probably have learned a bit of SQL along the way. So you and your fellow developers already know it, making it easier for teams to onboard new members and quickly extract value from the data. With a proprietary language, learning the language is in itself a barrier to using the data—you’ll have to ask another team to write the queries or rely on a separate data lake.
Time-series data is ubiquitous. At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.
And time-series data comes at you fast, sometimes generating millions of data points per second. Because of the sheer volume and rate of information, time-series data can be complex to query and analyze, even in SQL.
TimescaleDB hyperfunctions make it easier to manipulate and analyze time-series datasets with fewer lines of SQL code. Hyperfunctions are purpose-built for the most common and difficult time-series and analytical queries developers write today in SQL. Using hyperfunctions makes you more productive when querying time-series data, which means you can spend less time creating reports, dashboards, and visualizations involving time series, and spend more time acting on the insights that your work unearths!
Handling Time-Series Data: Meet Hyperfunctions
There are over 70 different TimescaleDB hyperfunctions ready to use today. Here are some of the most popular ones and how they can help you handle your time-series data:
Time-based analysis: time_bucket() makes time-based analysis simpler and easier by enabling you to analyze data over arbitrary time intervals using succinct queries.
first() and last() allow you to get the value of one column as ordered by another (2x faster in TimescaleDB 2.7!).
Time-weighted averages: time_weight() and related hyperfunctions for working with time-weighted averages offer a more elegant way to get an unbiased average when working with irregularly sampled data.
Function pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL (currently experimental).
Percentile approximation brings percentile analysis to more workflows, enabling you to understand the distribution of your data efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets. When used with continuous aggregates, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data.
Frequency analysis: Freq_agg() and related frequency analysis hyperfunctions more efficiently find the most common elements out of a set of vastly more varied values vs. brute force calculation.
Histogram shows the data distribution and can offer a better understanding of the segments compared to an average (more on histograms).
Downsampling: ASAP smoothing smooths datasets to highlight the most important features when graphed. Largest Triangle Three Buckets Downsampling or lttb() reduces the number of elements in a dataset while retaining important features when graphed. See how to apply our downsampling hyperfunctions in Grafana.
Memory efficient COUNT DISTINCTs: HyperLogLog is a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query. It is ideal for use in a continuous aggregate for large datasets.
We created new SQL functions for each of these time-series analysis and manipulation capabilities. This contrasts with other efforts to improve the developer experience by introducing new SQL syntax. While introducing new syntax with new keywords and constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe it leads to a worse experience for the end-user. 
New SQL syntax means existing drivers, libraries, and tools may no longer work. That can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax or require time-consuming modifications. On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool. 
We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected!

We will now dive into each hyperfunction category that we mentioned and give examples of when, why, and how to use them, plus resources to continue your learning.
TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. Get started with a free Timescale trial—no credit card required. Or download for free with TimescaleDB self-managed.
If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, start our tutorial, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).  
Can’t find the function you need? Open an issue on our GitHub project or contact us on Slack or via the Timescale Community Forum. We love to work with our users to simplify SQL!
Solved: How to Query Arbitrary Time-Intervals With date_trunc
When using PostgreSQL, the date_trunc function can be useful when you want to aggregate information over an interval of time. date_trunc truncates a TIMESTAMP or an INTERVAL value based on a specified date part (e.g., hour, week, or month) and returns the truncated timestamp or interval. For example, date_trunc can aggregate by one second, one hour, one day, or one week. However, you often want to see aggregates by the time intervals that matter most to your use case, which may be intervals like 30 seconds, 5 minutes, 12 hours, etc. This can get pretty complicated in SQL, just look at the query below which analyzes taxi ride activity in five-minute time intervals:
Regular PostgreSQL: Taxi rides taken every five minutes
SELECT
  EXTRACT(hour from pickup_datetime) as hours,
  trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins,
  COUNT(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY hours, five_mins;

The time_bucket() hyperfunction makes it easy to query your data in whatever time interval is most relevant to your analysis use case. time_bucket() enables you to aggregate data by arbitrary time intervals (e.g., 10 seconds, 5 minutes, 6 hours, etc.), and gives you flexible groupings and offsets, instead of just second, minute, hour, and so on.
In addition to allowing more flexible time-series queries, time_bucket() also allows you to write these queries in a simpler way. Just look much simpler the query from the example above is to write and understand when using the time_bucket() hyperfunction:
TimescaleDB hyperfunctions: Taxi rides taken every five minutes
-- How many rides took place every 5 minutes for the first day of 2016?
SELECT time_bucket('5 minute', pickup_datetime) AS five_min, count(*)
FROM rides
WHERE pickup_datetime < '2016-01-02 00:00'
GROUP BY five_min
ORDER BY five_min;

If you’d like even more flexibility when aggregating your data, you can test out the  hyperfunction time_bucket, which is an updated version of the original time_bucket() hyperfunction. time_bucket enables you to bucket your data by years and months, in addition to second, minute, and hour time intervals. This allows you to easily do monthly cohort analysis or other multiple-month-based reports in SQL.
SELECT time_bucket('3 month', date '2021-08-01');
 time_bucket
----------------
 2021-07-01
(1 row)

time_bucket also features custom timezone support, which enables you to write queries like the one below, which illustrates using it to bucket data in the Europe/Moscow region:
-- note that timestamptz is displayed differently depending on the session parameters
SET TIME ZONE 'Europe/Moscow';

SELECT time_bucket('1 month', timestamptz '2001-02-03 12:34:56 MSK', timezone => 'Europe/Moscow');
     time_bucket
------------------------
 2001-02-01 00:00:00+03

Missing data or gaps is a common occurrence when capturing hundreds or thousands of time-series readings per second or minute. This can happen due to irregular sampling intervals, or you have experienced an outage of some sort. 
The time_bucket_gapfill() hyperfunction enables you to create additional rows of data in any gaps, ensuring that the returned rows are in chronological order and contiguous. To learn more about gappy data, read our blog Mind the Gap: Using SQL Functions for Time-Series Analysis.
Here’s an example of time_bucket_gapfill() in action, where we find the daily average temperature for a certain device and use the locf() function to carry the last observation forward in the case we have gaps in our data:
SELECT
  time_bucket_gapfill('1 day', time, now() - INTERVAL '1 week', now()) AS day,
  device_id,
  avg(temperature) AS value,
  locf(avg(temperature))
FROM metrics
WHERE time > now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;

           day          | device_id | value | locf
------------------------+-----------+-------+------
 2019-01-10 01:00:00+01 |         1 |       |
 2019-01-11 01:00:00+01 |         1 |   5.0 |  5.0
 2019-01-12 01:00:00+01 |         1 |       |  5.0
 2019-01-13 01:00:00+01 |         1 |   7.0 |  7.0
 2019-01-14 01:00:00+01 |         1 |       |  7.0
 2019-01-15 01:00:00+01 |         1 |   8.0 |  8.0
 2019-01-16 01:00:00+01 |         1 |   9.0 |  9.0
(7 rows)

The last observation carried forward or locf() function allows you to carry forward the last seen value in an aggregation group. You can only use it in an aggregation query with time_bucket_gapfill.
To learn more about using the time_bucket family of hyperfunctions, read the docs, and get started with our tutorial, which uses time_bucket() to analyze a real-world IoT dataset.
Simpler Time-Weighted Averages
If you’re in a situation where you don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in the following cases:
Industrial IoT, where teams “compress” data by only sending points when the value changes.
Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations.
Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered).
Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.

To illustrate the value of a hyperfunction to find time-weighted averages, consider the following example of a simple table modeling freezer temperature:
CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);

And some irregularly sampled time-series data representing the freezer temperature:
INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);


Calculating the time-weighted average temperature of the freezer using regular SQL functions would look something like this:
Time-weighted averages using regular SQL
WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 

But, with the TimescaleDB time_weight() hyperfunction, we reduce this potentially tedious to write and confusing to read query to a much simpler five-line query:
SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;

freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111

To learn more about using time-weighted average hyperfunctions, read the docs and see our explainer blog post: What time-weighted averages are and why you should care.
Better Data Summaries Using Percentile Approximation
Many developers choose to use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large time-series datasets, both in computational resources and time.
As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls of calculating exact percentiles. 
TimescaleDB’s percentile approximation hyperfunctions enable you to understand your data distribution efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets.
With relatively large datasets, you can often accept some accuracy trade-offs to avoid running into issues of high memory footprint and network costs while enabling percentiles to be computed more efficiently in parallel and used on streaming data. (In this post, you can learn more about the design decisions and trade-offs made in TimescaleDB’s percentile approximation hyperfunctions design.)
TimescaleDB has a whole family of percentile approximation hyperfunctions. The simplest way to call them is to use the percentile_agg aggregate along with the approx_percentile accessor. For example, here’s how we might calculate the 10th, 50th (mean), and 90th percentiles of the response time of a particular API:
SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;

Hyperfunctions for percentile approximation can also be used in TimescaleDB´s continuous aggregates which make aggregate queries on very large datasets run faster. Continuous aggregates continuously and incrementally store the results of an aggregation query in the background. So, when you run the query, only the changed data needs to be computed, not the entire dataset.
That is a huge advantage compared to exact percentiles because you can now do things like baselining and alerting on longer periods without recalculating from scratch every time!
For example, here’s how you can use continuous aggregates to identify recent outliers and investigate potential problems. First, we create a one-hour aggregation from the hypertable responses:
CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts');

CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);

To find outliers, we can find the data in the last 30 seconds greater than the 99th percentile:
SELECT * FROM responses 
WHERE ts >= now()-'30s'::interval
AND response_time > (
	SELECT approx_percentile(0.99, percentile_agg)
	FROM responses_1h_agg
	WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);

To learn more about using percentile approximation hyperfunctions, read the docs, try our tutorial using real-world NFL data and see our explainer blog post on why percentile approximation is more useful than averages.
first( )and last( )
Another common problem is finding the first or last values for multiple time series. That often occurs in IoT scenarios, where you want to monitor devices in different locations, but each device sends back data at different times (as devices can go offline, experience connectivity issues,  batch transmit data, or simply have different sampling rates).
The last hyperfunction allows you to get the value of one column as ordered by another. For example, last(temperature, time) returns the latest temperature value based on time within an aggregate group. 
This way, you can write queries more easily which, for example, will find the last recorded temperature at multiple locations, as each location might have different rates of data being sampled and recorded:
SELECT location, last(temperature, time)
  FROM conditions
  GROUP BY location;

Similarly, the first hyperfunction also allows you to get the value of one column as ordered by another. first(temperature, time) returns the earliest temperature value based on time within an aggregate group:
SELECT device_id, first(temp, time)
FROM metrics
GROUP BY device_id;

first() and last() can also be used in more complex queries, such as finding the latest value within a specific time interval. In the example below, we find the last temperature recorded for each device in five minutes throughout the past day:
SELECT device_id, time_bucket('5 minutes', time) AS interval,
  last(temp, time)
FROM metrics
WHERE time > now () - INTERVAL '1 day'
GROUP BY device_id, interval
ORDER BY interval DESC;

In TimescaleDB 2.7, we’ve made improvements to make queries with the first() and last() hyperfunctions up to twice as fast and make memory usage near constant.
🚀
Note: The last and first commands do not use indexes but perform a sequential scan through their groups. They are primarily used for ordered selection within a GROUP BY aggregate and not as an alternative to an ORDER BY time DESC LIMIT 1 clause to find the latest value (which uses indexes).
To learn more, see the docs for first() and last().
More Memory Efficient COUNT DISTINCT Queries
Calculating the exact number of distinct values in a large dataset with high cardinality requires lots of computational resources, which can impact the query performance and experience of your database's concurrent users. 
To solve this issue, TimescaleDB provides hyperfunctions to calculate approximate COUNT DISTINCTs. Approximate count distincts do not calculate the exact cardinality of a dataset, but rather estimate the number of unique values, in order to improve compute time. We use HyperLogLog, a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query.  
Hyperloglog() is an approximation object for COUNT DISTINCT queries. And the distinct_count() accessor function gets the number of distinct values from a HyperLogLog object, as illustrated in the example below, which efficiently estimates the number of unique NFTs and collections in a hypothetical NFT marketplace:
SELECT
  distinct_count(hyperloglog(32768, asset_id)) AS nft_count,
  distinct_count(hyperloglog(32768, collection_id)) AS collection_count
FROM nft_sales
WHERE payment_symbol = 'ETH' AND time > NOW()-INTERVAL '3 months'

You can also use the std_error() function to estimate the relative standard error of the HyperLogLog compared to running COUNT DISTINCT directly. 
To learn more about the approximate COUNT DISTINCT hyperfunctions, read the docs.
Enhanced Query Readability and Maintenance With Function Pipelines 
🚀
Note: In the spirit of moving fast and not breaking things, the hyperfunctions in this section are released as experimental—please play around with them but don’t use them in production.
At Timescale, we’re huge fans of SQL. But as we’ve seen in many examples above, SQL can get quite unwieldy for certain kinds of analytical and time-series queries. Enter TimescaleDB Function Pipelines.
TimescaleDB Function Pipelines radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from functional programming and popular tools like Python’s Pandas and PromQL. In short, they improve your coding productivity, making your SQL code easier for others to comprehend and maintain.
Inspired by functional programming languages, Function Pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.
And the best part: we built Function Pipelines in a fully PostgreSQL-compliant way! We did not change any SQL syntax, meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.
To understand the power of TimescaleDB Function Pipelines, consider the following PostgreSQL query.
Regular PostgreSQL query:
SELECT device_id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts >= now() - '1 day'::interval) calc_delta
GROUP BY device_id;

Supercharge Your Productivity With Hyperfunctions Today
Get started today: TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. Get started with a free Timescale trial—no credit card required. Or download for free with TimescaleDB self-managed.
If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, start our tutorial, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).  
Learn more: If you’d like to learn more about TimescaleDB hyperfunctions and how to use them for your use case, read our How-To Guide and the hyperfunctions documentation. 
Can’t find the function you need? Open an issue on our GitHub project or contact us on Slack or via the Timescale Community Forum. We love to work with our users to simplify SQL!




Speed Up Grafana by Auto-Switching Between Different Aggregations With Postgres
Avthar Sewrathan — Tue, 11 Aug 2020 14:13:02 GMT
Updated March 4, 2026 - Learn how (and why) to speed up your Grafana drill-downs using PostgreSQL to allow "auto-switching" between aggregations, depending on the time interval you select.
The problem: Grafana is slow to load visualizations, especially for non-aggregated, fine-grained data
The Grafana UI is great for drilling down into your data. However, for large amounts of data with second, millisecond, or even nanosecond time granularity, it can be frustratingly slow and result in higher resource usage.
For example, take this graph of all New York City taxi rides during the month of January 2016:
An example of how slow drill-downs into data can be in Grafana
One common workaround: instead of querying raw data and aggregating on the fly, you query and visualize data from aggregates of your raw data (e.g., one-minute, one-hour, or one-day rollups).
For PostgreSQL data sources, we do this by aggregating data into views and querying those instead, and for TimescaleDB, we use continuous aggregates—think “automatically refreshing materialized views” (for more, see the continuous aggregates docs).
However, this often leads to several Grafana panels, each querying the same data aggregated at different granularities. For example, you might capture the same metric over time but set up aggregates at various intervals, such as in minute, hourly, and daily intervals.
This then requires three separate panels, one for each aggregated interval.
Example of three panels all showing taxi rides over January 2016 but in different time granularities (daily, hourly, and per minute, from top to bottom)
But, what if we could use one universal panel that could “automatically” switch between minute, hourly, daily, or any other arbitrary aggregations of our data, depending on the time period we’d like to query and analyze? This would speed up queries and use resources like CPU more efficiently.
Enter the PostgreSQL UNION ALL function.
The Solution: Use Postgres UNION ALL
When we use PostgreSQL as our Grafana data source, we can write a single query that automatically switches between different aggregated views of our data (e.g., daily, hourly, weekly views, etc.) in the same Grafana visualization (!).
🔑 The key: we (1) use the UNION ALL function to write separate queries to pull data with different aggregations, and (2) then use the WHERE clause to switch the table (or continuous aggregate view) being queried, depending on the length of the time-interval selected (from either the time picker or by highlighting the time period in a graph).
This allows us to drill arbitrarily deep into our data and makes loading the data as efficient and fast as possible, saving time and CPU resources. (In Grafana, drilling into data is typically done by zooming in and out, highlighting the time period of interest in the graph as shown in the image below.)
Example of auto-switching between different data aggregations depending on the time interval selected. Learn how to create this example in the tutorial below.
Try It Yourself: Implementation in Grafana & Sample Queries
To help you get up and running with UNION ALL, I’ve put together a short step-by-step guide and a few sample queries (which you can modify to suit your project, app, and the metrics you care about).
Scenario
We’ll use the use case of monitoring IoT devices, specifically taxis equipped with sensors. For reference, we’ll use a dataset containing all New York City taxi ride activity for January 2016 from the New York Taxi and Limousine Commission (NYC TLC).
Prerequisites
TimescaleDB instance (Tiger Cloud or self-hosted) running PostgreSQL 11+
Grafana instance (cloud or self-hosted)
TimescaleDB instance connected to Grafana (see this tutorial for more)
Use the queries below to create two continuous aggregates with refresh policies. These will be the aggregate views we switch between in our Grafana visualization:
To create daily aggregates:
CREATE MATERIALIZED VIEW rides_daily
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('1 day', pickup_datetime) AS day, COUNT(*) AS ride_count
    FROM rides
    GROUP BY day
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_daily',
    start_offset => INTERVAL '1 month',
    end_offset => INTERVAL '1 day',
    schedule_interval => INTERVAL '1 day');
SQL query to create daily aggregates of rides during January 2016
This computes a roll-up of the total number of rides taken during each day during the time period of our data (January 2016).
To create hourly aggregates:
CREATE MATERIALIZED VIEW rides_hourly
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('1 hour', pickup_datetime) AS hour, COUNT(*) AS ride_count
    FROM rides
    GROUP BY hour
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_hourly',
    start_offset => INTERVAL '1 month',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour');
SQL query to create hourly aggregates of rides during January 2016
This computes a roll-up of the total number of rides taken during each hour during the time period of our data.
For more on how continuous aggregates work, see these docs.
Example 1: Auto-switch between daily aggregate, hourly aggregate, and raw data
In the example below, we have a query using UNION ALL, where we only select a specific table or view, depending on the length of time selected interval in the Grafana UI (controlled by the $__timeFrom and $__timeTo macros in Grafana).
As the comments in the code below show, we use daily aggregates for intervals greater than 14 days, hourly aggregates for intervals between 3 and 14 days, and per-minute aggregates calculated on the fly from raw data for intervals less than 3 days:
Switching between daily aggregation, hourly aggregation, and minute aggregations on raw data
-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp > '14 days'::interval AND $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND $__timeFilter(hour)
UNION ALL
-- Use raw data (minute intervals) intervals between 0 and 3 days
SELECT * FROM
    (SELECT time_bucket('1m',pickup_datetime) AS time, count(*), 'minute' AS metric
    FROM rides
    WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp < '3 days'::interval AND $__timeFilter(pickup_datetime)
    GROUP BY 1) minute
ORDER BY 1;
Query to switch between daily aggregation, hourly aggregation, and per-minute aggregations created on the fly using raw data
This produces the following behavior in our Grafana panels:
Querying daily aggregates for intervals greater than 14 days:
The graph is powered by the daily aggregate view for intervals greater than 14 days
Querying hourly aggregates for intervals between 3-14 days:
The graph is powered by the hourly aggregate view for intervals between 3 and 14 days
Querying raw data for intervals less than 3 days:
The graph is powered by rolling up raw data into 1-minute intervals on the fly for intervals of less than 3 days
This allows you to automatically switch between different aggregations of data, depending on the length of the time interval selected. Notice how the granularity of the data gets richer as we drill down from looking at data over the month of January to looking at data in a single day:
Demo of automatically switching between daily, hourly, and minute aggregations of data, depending on time interval selected
Example 2: Auto-switch between daily, hourly, and 10-minute aggregates
Querying only from continuous aggregates allows us to speed up our dashboards even further. You might not want to directly query the hypertable that houses your raw data, as the queries may be slower due to things like new data being inserted into the hypertable.
The following example shows a query for switching between aggregations of different granularity without using the raw data hypertable at all (unlike Example 1, which does on-the-fly rollups of raw data).
First, let’s create 10-minute rollups of the raw data:
CREATE MATERIALIZED VIEW rides_10mins
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('10 minutes', pickup_datetime) AS bucket, COUNT(*) AS ride_count
    FROM rides
    GROUP BY bucket
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_10mins',
    start_offset => INTERVAL '1 month',
    end_offset => INTERVAL '10 minutes',
    schedule_interval => INTERVAL '10 minutes');
Query to create 10-minute rollups of data in a continuous aggregate
Switching between daily aggregation, hourly aggregation, and minute aggregations (no raw data involved)
-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp > '14 days'::interval AND  $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND  $__timeFilter(hour)
UNION ALL
-- Use 10-minute aggregate for intervals between 0 and 3 days
SELECT bucket, ride_count, '10min' AS metric
FROM rides_10mins
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp < '3 days'::interval AND  $__timeFilter(bucket)
ORDER BY 1; 
Query to switch between daily aggregation, hourly aggregation, and per-minute aggregations, all using continuous aggregates
In this post, we saw how to use UNION ALL to automatically switch which aggregate view we’re querying on based on the time interval selected so that we can do more efficient drill downs and make Grafana faster
You can find more information about the UNION ALL function and how it works in this PostgreSQL tutorial—from the aptly named PostgreSQLtutorial.com—and official PostgreSQL documentation.
That’s it! You can modify this code to change the aggregates you query, time intervals, and the metrics you want to visualize to suit your needs and projects.
Happy auto-switching!
Next Steps
In this tutorial, we learned how to use PostgreSQL UNION ALL to solve a common Grafana issue: slow-loading dashboards when we want to query fine-grained raw data (like millisecond performance metrics).
The result: you create graphs that enable you to switch between different aggregations of your data automatically. This allows you to quickly drill down into your metrics, saving time and CPU resources!
For more resources to speed up Grafana performance, learn how you can fix slow dashboards using downsampling.
Learn more
Want more Grafana tips? Explore our Grafana tutorials.
Need a database to power your dashboarding and data analysis? Get started with a free Tiger Cloud account.


How to Analyze Cryptocurrency Market Data using TimescaleDB, PostgreSQL and Tableau: a Step-by-Step Tutorial
Avthar Sewrathan — Thu, 19 Sep 2019 19:55:13 GMT
This tutorial is a step-by-step guide on how to analyze a time-series cryptocurrency dataset using Postgres, TimescaleDB and Tableau. The instructions in this tutorial were used to create this analysis of 4100+ cryptocurrencies.
Overview of steps
Step 0: Install TimescaleDB via Timescale Cloud: We’ll create a Timescale Cloud account and spin up a TimescaleDB instance.
Step 1: Design the database schema: We’ll guide you through how to design a schema for cryptocurrency data to use with TimescaleDB.
Step 2: Create a dataset to analyze: We’ll use the CryptoCompareAPI and Python to create a CSV file containing the data to analyze.
Step 3: Load dataset into TimescaleDB: We’ll insert the data from the CSV file into TimescaleDB using pgAdmin.
Step 4: Query the data in TimescaleDB: We’ll connect our data in TimescaleDB to Tableau and perform queries on the dataset.
Step 5: Visualize the results: We’ll use Tableau in order to visualize the results from our queries.
You can download all files and code used in this analysis in this Github repo. Note that the dataset provided tracks OHLCV price data on 4198 different cryptocurrencies (courtesy of CryptoCompare) as of 9/16/2019. Should you follow the steps correctly, your dataset will be up to the date that you perform the analysis.
Step 0: Install TimescaleDB via Timescale Cloud
Go to www.timescale.com/cloud and sign up for a free trial, where you will receive $300 in credits, to use a cloud-hosted and managed version of TimescaleDB. This is the easiest way to install the DB. If you prefer, you can install an instance yourself on your machine by following these instructions. However, the instructions in this post will assume you’re using Timescale Cloud.
After you’ve created an account, log-in and create a database instance (you can name it something like “crypto_database”). Then select your prefered configuration (dev-only should be enough for this analysis). After successfully creating the database instance, you should see it active.
Fig 1: Timescale Cloud page showing an active TSDB instance
Once the instance is active, navigate to “Databases” and create a new database. I’ve called mine ‘crypto-test’. You should see it in the list of Databases after its been created.
Fig 2: Successful creation of a database in a Timescale Cloud instance
Step 1: Design the database schema
Now that our database is up and running we need some data to insert into it. Before we get data for analysis, we first need to define what kind of data we want to perform queries on. (To skip ahead, see the code in schema.sql)
In our analysis, we have two main goals.
We want to explore the price of Bitcoin and Ethereum, expressed in different fiat currencies, over time.
We want to explore the price of different cryptocurrencies, expressed in Bitcoin, over time.
Examples of questions we might want to ask are:
How has Bitcoin’s price in USD varied over time?
How has Ethereum’s price in ZAR varied over time?
How has Bitcoin’s trading volume in KRW increased or decreased over time?
Which crypto has highest trading volume in last two weeks?
Which day was Bitcoin most profitable?
Which are the most profitable new coins from the past 3 months?
Understanding the questions required of the data leads us to define a schema for our database, so that we can acquire the necessary data to populate it.
Our requirements leads us to 4 tables, specifically, three TimescaleDB hypertables, btc_prices, crypto_prices and eth_prices, and 1 relational table, currency_info.
The btc_prices hypertable contains data about Bitcoin prices in 17 different fiat currencies since 2010:



btc_prices hypertable schema





Field
Description


time
The day-specific timestamp of the price records, with time given as the default 00:00:00+00


opening_price
The first price at which the coin was exchanged that day


highest_price
The highest price at which the coin was exchanged that day


lowest_price
The lowest price at which the coin was exchanged that day


closing_price
The last price at which the coin was exchanged that day


volume_btc
The volume exchanged in the cryptocurrency value that day, in BTC.


volume_currency
The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.


currency_code
Corresponds to the fiat currency used for non-btc prices/volumes.




Similar to btc_prices, the eth_prices hypertable contains data about Ethereum prices in 17 different fiat currencies since 2015:



eth_prices hypertable schema





Field
Description


time
The day-specific timestamp of the price records, with time given as the default 00:00:00+00


opening_price
The first price at which the coin was exchanged that day


highest_price
The highest price at which the coin was exchanged that day


lowest_price
The lowest price at which the coin was exchanged that day


closing_price
The last price at which the coin was exchanged that day


volume_eth
The volume exchanged in the cryptocurrency value that day, in ETH.


volume_currency
The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.


currency_code
Corresponds to the fiat currency used for non-ETH prices/volumes.



The crypto_prices hypertable contains data about 4198 cryptocurrencies, including bitcoin and the corresponding crypto/BTC exchange rate, since 2012 or so.



crypto_prices hypertable schema





Field
Description


time
The day-specific timestamp of the price records, with time given as the default 00:00:00+00


opening_price
The first price at which the coin was exchanged that day


highest_price
The highest price at which the coin was exchanged that day


lowest_price
The lowest price at which the coin was exchanged that day


closing_price
The last price at which the coin was exchanged that day


volume_eth
The volume exchanged in the cryptocurrency value that day, in ETH.


volume_currency
The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.


currency_code
Corresponds to the fiat currency used for non-ETH prices/volumes.



Lastly, we have the currency_info hypertable, which maps the currency’s code to its name.



currency_info table schema





Field
Description


currency_code
2-7 character abbreviation for currency. Used in other hypertables


currency
English name of currency



Once we’ve established the schema for the tables in our database, we can formulate create_table SQL statements to actually create the tables we need:
Code from schema.sql:
--Schema for cryptocurrency analysis
DROP TABLE IF EXISTS "currency_info";
CREATE TABLE "currency_info"(
   currency_code   VARCHAR (10),
   currency        TEXT
);

--Schema for btc_prices table
DROP TABLE IF EXISTS "btc_prices";
CREATE TABLE "btc_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_btc      DOUBLE PRECISION,
   volume_currency DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Schema for crypto_prices table
DROP TABLE IF EXISTS "crypto_prices";
CREATE TABLE "crypto_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_crypto   DOUBLE PRECISION,
   volume_btc      DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Schema for eth_prices table
DROP TABLE IF EXISTS "eth_prices";
CREATE TABLE "eth_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_eth      DOUBLE PRECISION,
   volume_currency DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Timescale specific statements to create hypertables for better performance
SELECT create_hypertable('btc_prices', 'time', 'opening_price', 2);
SELECT create_hypertable('eth_prices', 'time', 'opening_price', 2);
SELECT create_hypertable('crypto_prices', 'time', 'currency_code', 2);
Notice that we include 3 create_hypertable statements which are special TimescaleDB statements. For more on hypertables, see the Timescale docs and this blog post.
Step 2: Create a dataset to analyze
Now that we’ve defined the data we want, it’s time to construct a dataset containing that data. To do this, we’ll write a small python script (to skip ahead see crypto_data_extraction.py) for extracting data from cryptocompare.com into 4 csv files (coin_names.csv, crypto_prices.csv, btc_prices.csv and eth_prices.csv).
In order to get data from cryptocompare, you’ll need to obtain an API key. For this analysis, the free key should be plenty.
The script consists of 5 parts:
(1) Setup: First, we need to import some libraries to help us parse the data. Notably, we will use the python ‘requests’ library, which make it easy to deal with JSON data from a web API endpoint.
import requests
import json
import csv
from datetime import datetime
Moreover you’ll need your CryptoCompare API key as a variable. We’ve just included it as a normal variable in the code below (this is not recommended for production code) but you can store it as an environment variable or follow whatever production security practices for API key management you usually do.
apikey = 'YOUR_CRYPTO_COMPARE_API_KEY'
#attach to end of URLstring
url_api_part = '&api_key=' + apikey

(2) Get a list of all coin names to populate table currency_info: First we use the Python requests library’s get function to get a JSON object containing the list of coins names and symbols on CryptoCompare. Then we convert the data to a dictionary form and write information about the coin to the csv file ‘coin_names.csv’.
#####################################################################
#2. Populate list of all coin names
#####################################################################
#URL to get a list of coins from cryptocompare API
URLcoinslist = 'https://min-api.cryptocompare.com/data/all/coinlist'

#Get list of cryptos with their symbols
res1 = requests.get(URLcoinslist)
res1_json = res1.json()
data1 = res1_json['Data']
symbol_array = []
cryptoDict = dict(data1)

#write to CSV
with open('coin_names.csv', mode = 'w') as test_file:
   test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
   for coin in cryptoDict.values():
       name = coin['Name']
       symbol = coin['Symbol']
       symbol_array.append(symbol)
       coin_name = coin['CoinName']
       full_name = coin['FullName']
       entry = [symbol, coin_name]
       test_file_writer.writerow(entry)
print('Done getting crypto names and symbols. See coin_names.csv for result')
(3) Get historical BTC prices for 4198 other cryptos to populate crypto_prices: Once we have the list of all the coin names, we can iterate through them and pull their historical prices in BTC since they listed on CryptoCompare. We write that information to the CSV file “crypto_prices.csv”.
#####################################################################
#3. Populate historical price for each crypto in BTC
#####################################################################
#Note: this part might take a while to run since we're populating data for 4k+ coins
#counter variable for progress made
progress = 0
num_cryptos = str(len(symbol_array))
for symbol in symbol_array:
   # get data for that currency
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym='+ symbol +'&tsym=BTC&allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('crypto_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, symbol]
           test_file_writer.writerow(entry)
   progress = progress + 1
   print('Processed ' + str(symbol))
   print(str(progress) + ' currencies out of ' +  num_cryptos + ' written to csv')
print('Done getting price data for all coins. See crypto_prices.csv for result')
Notice how the fields defined in Step 1 influence the choice of what data we write to the CSV file!
(4) Get historical Bitcoin prices in different fiat currencies to populate btc_prices: We then create a list of different fiat currencies in which we want to express Bitcoin’s price. Unfortunately CryptoCompare doesn’t have a comprehensive list so we’ve hard coded (gasp!) a list of 17 popular fiat currencies.
We then iterate over the list of fiat currencies and pull the historical Bitcoin price in that currency and write it to the CSV file “btc_prices.csv”.
#####################################################################
#4. Populate BTC prices in different fiat currencies
#####################################################################
# List of fiat currencies we want to query
# You can expand this list, but CryptoCompare does not have
# a comprehensive fiat list on their site
fiatList = ['AUD', 'CAD', 'CNY', 'EUR', 'GBP', 'GOLD', 'HKD',
'ILS', 'INR', 'JPY', 'KRW', 'PLN', 'RUB', 'SGD', 'UAH', 'USD', 'ZAR']

#counter variable for progress made
progress2 = 0
for fiat in fiatList:
   # get data for bitcoin price in that fiat
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym=BTC&tsym='+fiat+'&allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('btc_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, fiat]
           test_file_writer.writerow(entry)
   progress2 = progress2 + 1
   print('processed ' + str(fiat))
   print(str(progress2) + ' currencies out of  17 written')
print('Done getting price data for btc. See btc_prices.csv for result')

(5) Get historical Ethereum prices in different fiat currencies to populate eth_prices: Lastly, we do the same for Ethereum and the list of fiat currencies and write the results in the CSV file “eth_prices.csv”.
#####################################################################
#5. Populate ETH prices in different fiat currencies
#####################################################################
#counter variable for progress made
progress3 = 0
for fiat in fiatList:
   # get data for bitcoin price in that fiat
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym=ETH&tsym='+fiat+'&allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('eth_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, fiat]
           test_file_writer.writerow(entry)
   progress3 = progress3 + 1
   print('processed ' + str(fiat))
   print(str(progress3) + ' currencies out of  17 written')
print('Done getting price data for eth. See eth_prices.csv for result')
If you’d rather not pull a fresh data set, you’re welcome to use the dataset we already created, but note that it only has data until 9/16/2019.
Step 3: Load dataset into TimescaleDB, using Timescale Cloud and pgAdmin
After following Step 2, you should have 4 CSV files (if not download them here).
The next step is to load this data into TimescaleDB in order to query it and perform our analysis. In Step 0, we created a TimescaleDB instance in Timescale Cloud, so all that’s left is to use pgAdmin to create the tables from Step 1 and transfer data from each csv file to the relevant table.
3.1 Connect to your TimescaleDB instance
Download and install pgAdmin, or your favorite postgres admin tool (utilities like psql also work). Once installed, login to your database using the credentials on the ‘Overview page’ of your Timescale Cloud instance as shown in Fig 3.
Fig 3: Timescale Cloud ‘Overview’ page to find credentials to login
Once logged in, you should see something like Fig 4 below.
Fig 4: Successful login to Timescale Cloud database in pgAdmin!
3.2 Use the SQL code from Step 1 to create tables
Now all our hard work in Step 1 comes in handy! Use the Query Tool in pgAdmin to create the tables we defined in Step 1. One way to find the tool is to navigate to your_project_name -> Databases-> your_db_name, then right click and select Query Tool, as Fig 5 shows below.
Fig 5: Locating the Query Tool in pgAdmin
Next, we copy and paste the code from Step 1 (found in schema.sql) into the Query Editor and run the query. This creates the 4 tables with the necessary fields, as well as the 3 hypertables for btc_prices, eth_prices and crypto_prices.
Fig 6: Run the code from schema.sql in the Query Tool in pgAdmin to create the necessary tables
To check our query was successful, look at the output in the Data Output pane or navigate down to your_project_name -> databases-> your_db_name -> schemas -> tables and you should see your 4 tables.
Fig 7: Successful creation of tables for analysis in pgAdmin
3.3 Import the data into the database tables
Now that we’ve created the tables with our desired schema, all that’s left is to insert the data from the CSV files we’ve created into the tables.
We will do this using the Import tool in pgAdmin, but you could use psql or for better performance on large datasets, the TimescaleDB parallel-copy- tool. However, for files of the size we’ve created the pgAdmin import tool works just fine.
To import a CSV file to a table, navigate down to your_project_name -> databases-> your_db_name -> schemas -> tables and right click on the table you’d like to insert data into and then select ‘Import/Export’ from the menu, as shown in Fig 8 below. We’ll first insert data into the btc_prices table. 
Once you’ve selected Import/Export, select Import, then select the csv file from your directory (in this case, btc_prices.csv), and then select comma (,) as the delimiter, as shown in Fig 9 below.
Fig 9: Importing btc_prices.csv into the btc_prices table using pgAdmin
To check if this worked, right click on btc_prices table, select ‘view/edit data’ -> ‘all rows’, as shown in Fig 10. Fig 11 shows that our data has successfully been inserted into the btc_prices table.
Fig 10: Checking that data has been inserted into the table in pgAdmin
Fig 11: Verifying that btc price data is in fact in btc_prices table using pgAdmin
Repeat these steps above for crypto_prices.csv and the crypto_prices table, eth_prices.csv and the eth_prices table and coin_names.csv and the currency_info table, respectively.

Step 4: Query the Data
With the data needed for our analysis now sitting snugly in our database tables, we can now perform queries on our dataset in order to answer some of the questions posed in Step 1.
The code below (from crypto_queries.sql) contains a sample list of questions and corresponding queries to answer those questions. Of course, you can add in your own questions and create Postgres queries to answer them in addition to, or in place of, the questions and queries provided.
From crypto_queries.sql:
-Query 1
-- How did Bitcoin price in USD vary over time?
-- BTC 7 day prices
SELECT time_bucket('7 days', time) as period,
      last(closing_price, time) AS last_closing_price
FROM btc_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period

--Query 2
-- How did BTC daily returns vary over time?
-- Which days had the worst and best returns?
-- BTC daily return
SELECT time,
      closing_price / lead(closing_price) over prices AS daily_factor
FROM (
  SELECT time,
         closing_price
  FROM btc_prices
  WHERE currency_code = 'USD'
  GROUP BY 1,2
) sub window prices AS (ORDER BY time DESC)

--Query 3
-- How did the trading volume of Bitcoin vary over time in different fiat currencies?
-- BTC volume in different fiat in 7 day intervals
SELECT time_bucket('7 days', time) as period,
      currency_code,
      sum(volume_btc)
FROM btc_prices
GROUP BY currency_code, period
ORDER BY period

-- Q4
-- How did Ethereum (ETH) price in BTC vary over time?
-- ETH prices in BTC in 7 day intervals
SELECT
   time_bucket('7 days', time) AS time_period,
   last(closing_price, time) AS closing_price_btc
FROM crypto_prices
WHERE currency_code='ETH'
GROUP BY time_period
ORDER BY time_period

--Q5
-- How did ETH prices, in different fiat currencies, vary over time?
-- (using the BTC/Fiat exchange rate at the time)
-- ETH prices in fiat
SELECT time_bucket('7 days', c.time) AS time_period,
      last(c.closing_price, c.time) AS last_closing_price_in_btc,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'USD') AS last_closing_price_in_usd,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'EUR') AS last_closing_price_in_eur,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'CNY') AS last_closing_price_in_cny,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'JPY') AS last_closing_price_in_jpy,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'KRW') AS last_closing_price_in_krw
FROM crypto_prices c
JOIN btc_prices b
   ON time_bucket('1 day', c.time) = time_bucket('1 day', b.time)
WHERE c.currency_code = 'ETH'
GROUP BY time_period
ORDER BY time_period

--Q6
--Crypto by date of first data
SELECT ci.currency_code, min(c.time)
FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
AND c.closing_price > 0
GROUP BY ci.currency_code
ORDER BY min(c.time) DESC

--Q7
-- Number of new cryptocurrencies by day
-- Which days had the most new cryptocurrencies added?
SELECT day, COUNT(code)
FROM (
  SELECT min(c.time) AS day, ci.currency_code AS code
  FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
  AND c.closing_price > 0
  GROUP BY ci.currency_code
  ORDER BY min(c.time)
)a
GROUP BY day
ORDER BY day DESC


--Q8
-- Which cryptocurrencies had the most transaction volume in the past 14 days?
--Crypto transaction volume during a certain time period
SELECT 'BTC' as currency_code,
       sum(b.volume_currency) as total_volume_in_usd
FROM btc_prices b
WHERE b.currency_code = 'USD'
AND now() - date(b.time) < INTERVAL '14 day'
GROUP BY b.currency_code
UNION
SELECT c.currency_code as currency_code,
       sum(c.volume_btc) * avg(b.closing_price) as total_volume_in_usd
FROM crypto_prices c JOIN btc_prices b ON date(c.time) = date(b.time)
WHERE c.volume_btc > 0
AND b.currency_code = 'USD'
AND now() - date(b.time) < INTERVAL '14 day'
AND now() - date(c.time) < INTERVAL '14 day'
GROUP BY c.currency_code
ORDER BY total_volume_in_usd DESC

--Q9
--Which cryptocurrencies had the top daily return?
--Top crypto by daily return
WITH
   prev_day_closing AS (
SELECT
   currency_code,
   time,
   closing_price,
   LEAD(closing_price) OVER (PARTITION BY currency_code ORDER BY TIME DESC) AS prev_day_closing_price
FROM
    crypto_prices  
)
,    daily_factor AS (
SELECT
   currency_code,
   time,
   CASE WHEN prev_day_closing_price = 0 THEN 0 ELSE closing_price/prev_day_closing_price END AS daily_factor
FROM
   prev_day_closing
)
SELECT
   time,
   LAST(currency_code, daily_factor) as currency_code,
   MAX(daily_factor) as max_daily_factor
FROM
   daily_factor
GROUP BY
   TIME
For this step and in Step 5, we’ll use Tableau to run the above queries on the dataset and visualize the output. You’re welcome to use other data visualization tools like Grafana, but ensure that the tool you’ve selected has a Postgres connector.
The following steps are to query the data using Tableau:
4.1 Create a new workbook: This will be used to house all the graphs for the analysis.
4.2 Connect TimescaleDB to Tableau: Create a connection between Tableau and TimescaleDB running in your Timescale Cloud instance.
Connecting your TimescaleDB instance in the cloud to Tableau takes just a few clicks, thanks to Tableau’s built in Postgres connector. To connect to your database add a new connection and under the ‘to a server’ section, select PostgreSQL as the connection type. Then enter your database credentials (found in the Timescale Cloud ‘Overview’ tab) like we did in Step 3.
Fig 12: Adding a connection from Timescale Cloud in Tableau
4.3 Create a new data source: Create a new datasource and rename it to be something unique, as by default it’s the name of your database.
You’ll need to do this for each query the analysis, since each data source only supports one piece of custom SQL. A way to create many data sources with the same database is to create one in the way described above, duplicate it and then change the custom SQL used each time, since the database you’re connecting to remains the same.
4.4 Query the data: Here we’ll use Tableau and the built in SQL editor. To run a query, add custom SQL to your data source by dragging and dropping the “New Custom SQL” button to the place that says ‘Add tables here’.
Fig 13: Adding Custom SQL to a data source in Tableau
Once you’ve done that, paste the query you want in the query editor. In the example in Fig 14 below, we’ll use Query 1 from crypto_queries.sql, for historical BTC prices in USD.
Fig 14: Query for Historical Bitcoin prices in USD in the Tableau query editor

Once you’ve entered the query, press OK and then Update Now and you’ll see the results in a table, as illustrated by Fig 15 below.
Fig 15: Results from a successful execution of Query 1 in Tableau
Step 5: Data Visualization in Tableau
Results in a table are only so useful, graphs are much better! So in our final step, let’s take our output from Step 4 and turn it into an interactive graph in Tableau.
To do this, create a new worksheet (or dashboard) and then select your desired data source (in our case ‘btc 7 days’), shown in Fig 16 below.
Fig 16: A new Tableau worksheet linked to the data source ‘btc 7 days’
Next, you locate the Dimensions and Measures pane on the left. 
Fig 17: The dimensions and measures pane in Tableau
Then, drag the period (time) dimension to ‘Columns’ part of sheet and then the ‘last closing price’ measure to the rows part of the worksheet. You should see something like the graph shown in Fig 18 below.
Fig 18: Initial graph after dragging dimensions and measures in Tableau
Now this graph doesn’t quite have the level of fidelity we’re looking for because the data points are being grouped by year. To fix this, click on the drop down arrow on period and select ‘exact date’. 
Fig 19: Finding the exact date setting on a dimension in Tableau
This undoes the grouping by year and matches the price datapoint to the exact date that price occurred on. Your group should now look like Fig 20 below.
Fig 20: Graph after selecting correct setting for ‘period’ in Tableau
From there you can edit axis labels and colors and even add filter to zoom in on a specific time period. Here’s our final result, with labels added in Fig 21 below:
Fig 21: Final graph showing Bitcoin prices in USD from 2010-2019
We encourage you to explore different visualization formats for the results you obtain from queries provided. For inspiration, check out the different visualizations we used in our analysis.
Conclusion
This tutorial showed you step by step one method of defining, creating, loading and analyzing a cryptocurrency market dataset. Here’s a reminder of what we covered:
We created a Timescale Cloud account and spun up a TimescaleDB instance.
We learned how to design a schema for cryptocurrency data for TimescaleDB and PostgreSQL.
We used the CryptoCompareAPI and Python to create a CSV file containing the data to analyze
We inserted the data from the CSV files created into TimescaleDB using pgAdmin and Timescale Cloud.
We connected our data in TimescaleDB to Tableau and performed queries on the dataset
We used Tableau to create graphs to visualize the results from our queries
Thank you for reading this far and if you followed all the steps, congratulations on successfully completing this tutorial! We hope you’ve enjoyed following along and that you’ve found this tutorial helpful. 
Perhaps you’d also enjoy our analysis of over 4100 cryptocurrencies produced by following this tutorial.
For follow up questions or comments, reach out to us on Twitter (@TimescaleDB or @avthars), our community Slack channel, or reach out to me directly via email (avthar at timescale dot com).
Finally, if you’re interested in learning more about us, check out the Timescale website, Timescale Cloud, and see our GitHub and let us know how we can help!


Analyzing Bitcoin, Ethereum, and 4,100+ Other Cryptocurrencies Using PostgreSQL and TimescaleDB
Avthar Sewrathan — Thu, 19 Sep 2019 19:55:07 GMT
After a Crypto Winter in 2018, cryptocurrencies today are resurging. How can data help us better understand the Crypto Revival?
When Satoshi Nakamoto first published the Bitcoin whitepaper in 2008, they probably didn’t foresee the world of hodlers, lambos, buidlers, bitcoin maximalist carnivores, and n00bs asking “wen moon” in telegram channels, that their actions would create. In 11 years, crypto has gone from something completely esoteric to something seemingly everyone has heard about.
2019 has been a big year for crypto, so far. Some of the highlights include: the SEC approving it’s first token sale, the IRS tracking down crypto tax-evaders, university endowments investing in crypto and even Facebook announcing it’s own cryptocurrency. We also just this month saw over $1 Billion in Bitcoin transferred in a single transaction. All this, and more, indicates a revived interest in the crypto markets, since the highs of 2017 and lows of 2018, by everyone from institutional investors and banks to lay people trying to side hustle.
With the crypto markets once again awash with speculation and hype, it’s important to leverage all the tools at our disposal in order to make sense of the noise. Sometimes, reading articles and email newsletters isn’t enough. You have to go directly to the data.
As the developers of TimescaleDB, an open-source time-series database powered by PostgreSQL, we’re data-driven people. So, we thought it would be interesting to take a data-driven approach to analyzing the crypto market. For this analysis, we used PostgreSQL and TimescaleDB to analyze market data about Bitcoin, Ethereum, and 4,196 other cryptocurrencies and used Tableau to visualize our results.
This post shares many high-level, insights about the crypto market since its inception and during recent years. We answer questions like:
How has the price of Bitcoin and Ethereum changed in the past several years?
Which new cryptocurrencies have been the most profitable in the past 3 months?
What are the cryptocurrencies on the rise?
What was the best day to “day-trade” Bitcoin?
What countries have the highest trading volume of BTC today?
Why is Bitcoin a terrible way to pay for pizza?
...and many more, as we dive into analysis of topics like Bitcoin and Ethereum price, new coin growth, trading volume and daily returns.
We also share how powerful SQL is as a query language for analyzing time-series data, how TimescaleDB and PostgreSQL further simplify time-series data analysis, and how using the two with Tableau visualizations can surface interesting insights from your data.
For the technically curious, you can learn how to create the dataset we used for this analysis, load it, and draw insights from it, in this companion tutorial post. In the tutorial, you will find step by step instructions on how to create the dataset using Python (including all code we used for the analysis), how to load the data into Managed Service for TimescaleDB, a cloud-hosted version of TimescaleDB, and how to connect your database in the cloud to Tableau to recreate the analysis and produce graphs.
About the data used for this analysis
For this analysis, we used historical OHLCV price data for over 4100 cryptocurrencies from 7/17/2010 to 9/16/2019, courtesy of CryptoCompare and their wonderful API. While the dataset we used only includes daily data, TimescaleDB easily scales to handle data from much finer grained time periods.
Some of you may recall that we did a similar analysis on crypto back in 2017, but so much has happened since then, including the addition of almost 3000 more cryptocurrencies and Bitcoin hitting nearly ~$20k in price, that we had to revisit this topic. Where applicable, we’ve included graphs to focus on recent history, from 2017-2019, updating the analysis from our previous post.
DISCLAIMER: At Timescale, we help companies harness the power of time series data to make sense of the past, monitor the present, and predict the future. However, nothing in this analysis should be construed as financial advice and we take no liability for your actions as a result of using the information contained in this post. You’re welcome to draw your own conclusions using the tools and data and take your own risks accordingly.
So if you’d invested $100 in Bitcoin 9 years ago, today it’d be worth…
When analyzing cryptocurrencies, we have to start with the original: Bitcoin. For any beginners, Bitcoin can be thought of as digital gold, because Bitcoin has built-in scarcity (only 21 million BTC will ever be produced), can be almost infinitely divided without losing its unit value, and is difficult to counterfeit. (As an aside, for those looking for an introduction to Bitcoin and other cryptocurrencies, two good places to start are The Princeton Bitcoin Book and The Internet of Money.)
Looking at historical BTC-USD prices since 2010, we see that BTC prices have slowly increased, with an almost exponential increase taking place between 2014 and 2018. 
--Query 1
-- BTC 7 day prices
SELECT time_bucket('7 days', time) as period,      
	last(closing_price, time) AS last_closing_price
FROM btc_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period
Fig 1: Bitcoin Closing price in USD from 2010-2019
So to answer our original question (and probably create some FOMO), if you bought $100 worth of Bitcoin on 16 September 2010, the price of 1BTC was $0.0619, meaning $100 would have bought you 1615.5088853 BTC. Fast forward 9 years, that $100 would have grown to $16,476,736.67! Queue the lambos! (But of course you would have probably sold when BTC hit $100 back in 29 July 2013 :)).
Generating insights like this from time-series data takes no more than a 5 line SQL query thanks to TimescaleDB’s special timebucket and last functions, which are special functions exclusively created for TimescaleDB to simplify time-series analysis.
Why Bitcoin is a terrible way to pay for pizza
One thing that’s evident from the data is that Bitcoin is volatile. Bitcoin’s price volatility may mean that its main use case is being a store of value rather than a means of exchange. The Bitcoin Pizza Guy, who paid 10,000 BTC for a 2 pizzas back in 2010, would probably agree, as those 10,000BTC are worth over a $100,000,000 today!
Moreover, the time period between 2017 and 2019 saw a lot of ups and downs, so let’s zoom in on that period below:
Fig 2: Bitcoin Closing price in USD from July 2017 to July 2019
As Figure 2 shows, 2017-2019 has arguably been the most exciting and perhaps also the most painful time in the Bitcoin market, with Bitcoin prices reaching a high of nearly $20,000, before crashing to under $7,000 three months later. That decline continued throughout 2018, with some calling it the first crypto bear market and others a “Crypto Winter”. However, the recent run in price made by Bitcoin since April 2019 may be the first indication of the end of the Crypto Winter, with BTC prices reaching over $12,000 in June 2019.
The best day to “day-trade” Bitcoin? February 26, 2014.
In order to better understand Bitcoin’s volatility, let’s look at the daily return on BTC, as a factor of the previous day’s rate. This is simple to do using PostgreSQL window functions, as shown in the query below.
--Query 2
-- BTC daily return
SELECT time,
	closing_price / lead(closing_price) over prices AS daily_factor
FROM (  
 SELECT time,         
  closing_price  
FROM btc_prices  
WHERE currency_code = 'USD'  
GROUP BY 1,2
) sub window prices AS (ORDER BY time DESC)
Fig 3.1: BTC Daily Return from 2010-2019
From Figure 3.1, we can see large amounts of volatility in BTC price from 2010, culminating in a huge spike in daily return factor in February 2014. In fact, within just 7 days we saw the day with the lowest daily return factor — 0.428 on 20 February 2014 — and the highest ever daily return factor — 4.368 on 26 February 2014! It’s no wonder that stablecoins, or price-stable cryptocurrencies are being looked into as alternatives to use as a means of exchange within apps or for everyday transactions. Imagine paying for everything you buy in Bitcoin and the headaches that daily and weekly BTC price volatility causes.
Once again, let’s zoom in on the 2017-2019 time period:
Fig 3.2: BTC Daily Return from 2017-2019
From Figure 3.2, we see that Bitcoin’s most profitable day since the start of 2017 occurred recently on 25 August 2019, with a daily return of 1.99 times the previous day’s rate. The day with the biggest loss went to 16 January 2018, with a daily return factor of 0.8276, with 14 September 2017 coming in second biggest loss, with a daily return factor of 0.8379.
Bitcoin’s top countries by trading volume: US, Japan, South Korea, and... Poland!?
Cryptofever has taken the world by storm, with lots of adoption taking place outside of the USA, most notably in places like Europe and Asia. We can get a sense for how crypto is being adopted in different regions of the world by looking at Bitcoin trading volume in different fiat currencies over time.
--Query3
-- BTC trading volumes by currency 
SELECT time_bucket('14 days', time) as period,
	currency_code,       
    	sum(volume_btc)
FROM btc_prices
GROUP BY currency_code, period
ORDER BY period;
Fig 4: BTC trading volume in different fiat currencies
From figure 4 above, we can see that China saw huge amounts of bitcoin trading volume before government intervention which made buying Bitcoin illegal in mid 2017. We can more clearly see how drastic this effect was by looking at Figure 5 below, which is a bar graph version of Figure 4, showing volume of BTC trade in different fiat by year:
Fig 5: BTC Trading Volume in different fiat currencies by year
From figure 5, we can more clearly see the rise in Chinese (CNY) Bitcoin trading activity and how government intervention in 2017 brought that to a halt. Moreover, we can see Japan (JPY) and South Korea (KRW) overtaking Europe with respect to Bitcoin trading volume, with more volume than the Euro (EUR) during 2017 and 2018. This confirms the USA, Japan and South Korea as the world’s 3 largest bitcoin markets.
Furthermore, if we remove USD, CNY, JPY, KRW and EUR from the list of fiats we can get a sense for the trend in Bitcoin adoption outside the largest markets, as shown in Figure 6 below:
Fig 6: BTC Trading Volume in different fiat currencies by year (excluding USD, JPY, KRW, CNY, EUR)
Note how South Africa (ZAR) has seen rising BTC trade volumes since 2015, the dramatic increase in trading volume from Hong Kong (HKD) in 2017 and subsequent decrease, and that the currency with the highest trade volume outside the big 5 is none other than the Polish Zloty (PLN)!
Figure 6.1 shows trading volumes of BTC in PLN since 2014:
Fig 6.1: BTC trading volumes in PLN
Now if you’d bought $100 worth of Ethereum in 2015, today it’d be worth...
Ethereum is popularly regarded as the cryptocurrency with the second largest interest base after Bitcoin, however, it is fundamentally different from Bitcoin. While Bitcoin is considered to be digital gold, Ether is more like fuel (gas) that runs transactions on the Ethereum network.
Since the first currency with which you could buy Ethereum was Bitcoin, let’s take a look at historical ETH prices in BTC, shown in Figure 7 below:
-- Q4
-- ETH prices in BTC in 7 day intervals
SELECT  
	time_bucket('7 days', time) AS time_period,   
    last(closing_price, time) AS closing_price_btc
FROM crypto_prices
WHERE currency_code='ETH'
GROUP BY time_period
ORDER BY time_period
Fig 7: ETH Price in BTC since 2015
Figure 7 shows us Ethereum (ETH) closing prices since 3 August 2015 in weekly intervals, expressed in BTC. Notice that 2017 was a rollercoaster year for ETH, with the currency seeing it’s an all time high of 0.1402 BTC on 12 June 2017 and then crashing back down to 0.0288 BTC 4 December 2017, less than 6 months later. 
Let’s take a look at recent Ethereum prices by zooming in on the period since 2017 in Figure 8 below:
Fig 8: ETH Price in BTC since 1 January 2017
From Figure 8 above, it seems that ETH then went on another bull run in early 2018, with prices reaching 0.1052 BTC in the period around 22 January 2018. Since then, it seems like ETH prices have been trending downward, with the price on 16 September 2019 reaching 0.0188 BTC. While that’s not great for investors, it may prove to be a blessing for developers and decentralized application users in Ethereum’s ecosystem, as gas costs would be cheaper, perhaps decreasing the barriers to adoption.
Crypto Convertibles (not the car kind)
Since most people don’t think about prices in BTC (yet), and given how volatile BTC is, it’s useful to also examine ETH prices expressed different fiat currencies. There are two ways this could be done: First by looking at ETH prices directly in different fiat currencies, like we did for for Bitcoin and USD in Figure 1 above. Secondly, we could convert ETH prices in BTC to fiat currency prices, by looking at that day’s BTC to fiat exchange rate. For illustration purposes, we’ll use the second technique. While this may seem like a strange choice, it’s worth noting that conversions from one cryptocurrency to another and then to a fiat currency are fairly common in the cryptocurrency trading world. This is because many exchanges support buying cryptocurrencies with other cryptocurrencies (mainly BTC and ETH), but not all crypto currencies are purchasable directly with fiat currency.
In order to examine ETH prices in different fiat currencies in PostgreSQL, we joined two tables and used filters, as the code below illustrates:
--Q5
-- ETH prices in fiat
SELECT time_bucket('7 days', c.time) AS time_period,  
	last(c.closing_price, c.time) AS last_closing_price_in_btc,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'USD') AS last_closing_price_in_usd, 
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'EUR') AS last_closing_price_in_eur,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'CNY') AS last_closing_price_in_cny,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'JPY') AS last_closing_price_in_jpy,  
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'KRW') AS last_closing_price_in_krw
FROM crypto_prices c
JOIN btc_prices b   
	ON time_bucket('1 day', c.time) = time_bucket('1 day', b.time)
WHERE c.currency_code = 'ETH'
GROUP BY time_period
ORDER BY time_period
Fig 9.1: ETH Price in USD and EUR
Fig 9.2: ETH Price in JPY
Fig 9.3: ETH Price in CNY
Fig 9.4: ETH Price in KRW
One thing to notice is that the Figures 9.1-9.4 all the same shape, since they are another expression of the ETH price in BTC. The main difference in the graphs is the scale of the Y axis, as this reflects the respective currency’s BTC exchange rate. This is as a result of the decision to use the BTC-Fiat exchange rate for this conversion, rather than direct ETH-Fiat prices. When plotted all on the same axis, we get Figure 10 below:
Fig 10: ETH Price in Different Fiat Currencies
Fortunately, you can now directly purchase ETH using fiat currency on many exchanges. So let’s look at historical ETH prices in USD,in Figure 11 below:
-- ETH prices in USD
SELECT time_bucket('7 days', time) as period,       
	last(closing_price, time) AS last_closing_price
FROM eth_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period
Fig 11: ETH Price in USD from 2015-2019
Figure 11 above tells a similar story to that of Figure 8, as we can see the ETH bull run to $1,359 on 8 January 2018. ETH prices in USD have been trending downward since then, with the price on 16 September 2019 falling to $199.
So to answer our original question, if you bought $100 worth of Ethereum on 16 September 2015, the price of 1ETH was around $0.89, meaning $100 would have bought you 112.3596 ETH. That $100 would now have grown to $22,359.55! However, the best time to sell would’ve been during the peak Jan 2018, your 112.3596 ETH would’ve been worth $152 696.63! This just goes to show how important it is to time the market!
(One piece of analysis which we didn’t do, but encourage readers to do, is to examine the developer activity in the Ethereum ecosystem (eg Github commits/ issues) and see if that correlates to price in some way.)
Tracking 4000+ other cryptocurrencies, starting from inception
While we can’t see when exactly coins ICO’d or first got listed on exchanges, we can track the date a coin first got added to CryptoCompare as a proxy for its launch date. This allows us to track the launch of different coins over time, as seen in Figure 11.
--Q6
--Crypto by date of first data
SELECT ci.currency_code, min(c.time)
FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
AND c.closing_price > 0
GROUP BY ci.currency_code
ORDER BY min(c.time) DESC
Fig 11: Number of new cryptocurrencies launched year
It’s easy to conclude that a bull run in BTC prices might have fueled a massive increase in developer activity in the crypto space. The evidence for this comes from the the 738 new cryptos released in 2017, a year where BTC almost 10x’ed its BTC price between Jan(1 BTC = $2,435 on Jan 1) and December (1 BTC = $19,345 on Dec 16). With bitcoin prices steadily increasing, it’s no wonder that hundreds of developers tried their hand at creating their own cryptocurrency in the hopes of maybe building the next Bitcoin.
However, despite the dramatic crash in BTC prices throughout most of 2018, it’s interesting to see the amount of new cryptocurrencies launched in 2018 year being the highest ever, with 771 new cryptocurrency projects launching that year.
Figure 12 shows us the result of examining the number of new cryptos launched by day.
--Q7
-- Number of new cryptocurrencies by day
SELECT day, COUNT(code)
FROM (  
	SELECT min(c.time) AS day, ci.currency_code AS code  
    FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code  
    AND c.closing_price > 0  
   GROUP BY ci.currency_code  
   ORDER BY min(c.time))a
GROUP BY day
ORDER BY day DESC
Fig 12: Number of new cryptocurrencies launched by day
Two days in particular are super interesting. On 2 December 2014, we saw data for 81 cryptocurrencies being added to Cryptocompare and on 26 May 2017, we saw a whopping 134 new cryptos being added on that day!
These coins had higher transaction volumes than Bitcoin this month (Hint: It’s not Bitcoin Cash)
With over 4000+ cryptocurrencies out there and new ones coming out everyday, it can be hard to pick which ones are worth paying attention to. One helpful metric for spotting coins on the rise is the transaction volume. In Figure 11, we looked at the transaction volume for all 4054 coins in our dataset over the 14 day period from September 2 to 16 2019.
--Q8
--Crypto transaction volume during a certain time period
SELECT 'BTC' as currency_code,      
	sum(b.volume_currency) as total_volume_in_usd
FROM btc_prices b
WHERE b.currency_code = 'USD'
AND now() - date(b.time) < INTERVAL '14 day'
GROUP BY b.currency_code
UNION
SELECT c.currency_code as currency_code,      
	sum(c.volume_btc) * avg(b.closing_price) as total_volume_in_usd
FROM crypto_prices c JOIN btc_prices b ON date(c.time) = date(b.time)
WHERE c.volume_btc > 0
AND b.currency_code = 'USD'
AND now() - date(b.time) < INTERVAL '14 day'
AND now() - date(c.time) < INTERVAL '14 day'
GROUP BY c.currency_code
ORDER BY total_volume_in_usd DESC 
Fig 13: Cryptos with the most USD transaction volume over from Sept 2-16 2019
It’s surprising to see that Bitcoin, despite being the original cryptocurrency, did not have the largest transaction volume over the time period in question. That honor belonged to USD Tether, USD Tether -- a fiat backed stablecoin, with Ethereum coming in second and Litecoin, the proverbial silver to Bitcoin’s gold, coming in third. Bitcoin (BTC) had the fourth highest USD transaction volume in that 14 day period, followed by Rippple (XRP), a global payments system which has partnered with several banks and payment processors, and EOS, a smart contract platform.
What are the most profitable new cryptocurrencies?
Another way of making sense of the flood of new currencies is to look at how profitable coins are, as measured by total daily return. By honing in on the currencies with the highest increase in rate by day, we can gain a different perspective on which currencies might be worth looking into further.
One question to ask is which cryptocurrency has the highest daily return during a certain time period. Figures 14.1 and 14.2 below show cryptocurrencies sorted by their maximum daily return factor between June 16 and September 16 2019.
--Q9
--Top crypto by daily return
WITH   
	prev_day_closing AS (
SELECT   
	currency_code,   
    time,   
    closing_price,   
    LEAD(closing_price) OVER (PARTITION BY currency_code ORDER BY TIME DESC) AS prev_day_closing_price
FROM    
	crypto_prices  
),   
	daily_factor AS (
SELECT   
	currency_code,   
	time,   
	CASE WHEN prev_day_closing_price = 0 THEN 0 ELSE closing_price/prev_day_closing_price END AS daily_factor
FROM   
	prev_day_closing
)
SELECT   
	time,   
	LAST(currency_code, daily_factor) as currency_code,   
    MAX(daily_factor) as max_daily_factor
FROM   
	daily_factor
GROUP BY   
	time
Fig 14.1: Cryptocurrencies sorted by absolute highest daily return
From Figure 14.1 above, we see that Mixin (MIXI), an Ethereum based token, comes out on top with a maximum daily return factor of over 25 million times the previous day’s rate. BOMB, an experimental deflationary currency, had the second highest absolute daily return with a return factor of over 700,000.  We can get a better look at the rest of the data by removing these two outliers, as seen in Figure 14.2 below:
Fig 14.2: Cryptos sorted by absolute highest daily return (with MIXI, BOMB removed)
From Figure 14.2, we see other cryptos that have had big days in the past 3 months were, Double Eagle Coin (XDE2), Gold Reserve (XGR), SoulCoin (SOUL) and CCCoin (CCC). 
Furthermore, it’s interesting to note the difference in order of magnitude between the coins with the top daily return in the past 3 months, with MIXI achieving a daily return factor in the millions, BOMB in the hundreds of thousands and the next highest coins in the tens of thousands. 
Another interesting thing to look at is the amount of times a coin had the top daily return during a certain period of time. Figure 15 shows us the coins with the highest frequency of having the top daily return during the 3 months between 16 June and 16 September 2019.
Fig 15: Top coins by most days with highest daily return during a 3 month period
The coins with the highest frequency of having the top daily return are MIXI (discussed above), with 5 unique days with the top daily return, Bitether (BTR), a cryptocurrency built on the Ethereum platform, with unique 3 days, IceChain (ICHX) coming in third, with two unique days. 
Conclusion
In this post, we used the power of PostgreSQL and TimescaleDB to analyze a public cryptocurrency dataset of over 4100 cryptocurrencies over the time period 2010 to 2019. We examined time-series trends in Bitcoin and Ethereum prices, new coin growth, trading volume, daily returns, and more.
While our analysis aimed to provide a taste of what’s possible using PostgreSQL and TimescaleDB, we encourage you to take the tools we used and apply them to different crypto datasets and gain deeper insights!
We’ve created a companion tutorial post for those interested in re-creating this analysis or looking for a starting point to perform your own analysis. In the tutorial, you will find step by step instructions on how to create the dataset using Python (including all code we used for the analysis), how to load the data into Managed Service for TimescaleDB and how to connect your database in the cloud to Tableau to recreate the analysis and produce graphs. If you do perform your own analysis, let us know what interesting insights you find! 
Moreover, you can dig in to the technical side of Timescale and how we made PostgreSQL scalable for time-series data in this detailed post. If you’re interested in experiencing the power of Timescale for your time series data, sign up for Managed Service for TimescaleDB!
Please drop your comments below and share this post with others whom you think would enjoy it. For follow-up questions or comments, reach out to us on Twitter (@TimescaleDB or @avthars), our community Slack channel, or reach out to me directly via email (avthar at timescale dot com).

btc_prices hypertable schema
Field	Description
time	The day-specific timestamp of the price records, with time given as the default 00:00:00+00
opening_price	The first price at which the coin was exchanged that day
highest_price	The highest price at which the coin was exchanged that day
lowest_price	The lowest price at which the coin was exchanged that day
closing_price	The last price at which the coin was exchanged that day
volume_btc	The volume exchanged in the cryptocurrency value that day, in BTC.
volume_currency	The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.
currency_code	Corresponds to the fiat currency used for non-btc prices/volumes.