<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[Tiger Data Blog]]></title>
        <description><![CDATA[Insights, product updates, and tips from TigerData (Creators of TimescaleDB) engineers on Postgres, time series & AI. IoT, crypto, and analytics tutorials & use cases.]]></description>
        <link>https://www.tigerdata.com/blog</link>
        <image>
            <url>https://www.tigerdata.com/icon.ico</url>
            <title>Tiger Data Blog</title>
            <link>https://www.tigerdata.com/blog</link>
        </image>
        <generator>RSS for Node</generator>
        <lastBuildDate>Tue, 07 Apr 2026 20:13:03 GMT</lastBuildDate>
        <atom:link href="https://www.tigerdata.com/blog" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>
        <item>
            <title><![CDATA[Building Multi-Tenant RAG Applications With PostgreSQL: Choosing the Right Approach]]></title>
            <description><![CDATA[Learn how to pick the right approach to handle multi-tenancy for your use case when building multi-tenant RAG applications with PostgreSQL.]]></description>
            <link>https://www.tigerdata.com/blog/building-multi-tenant-rag-applications-with-postgresql-choosing-the-right-approach</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/building-multi-tenant-rag-applications-with-postgresql-choosing-the-right-approach</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[RAG]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Fri, 11 Oct 2024 15:51:00 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2024/10/Building-Multi-Tenant-RAG-Applications-With-PostgreSQL-Choosing-the-Right-Approach-1.webp">
            </media:content>
            <content:encoded><![CDATA[<p>If you’re building a retrieval-augmented generation (RAG) app with PostgreSQL and pgvector, you’ll probably run into the problem of handling multi-tenancy. This article explains how to pick the right approach to handle multi-tenancy for your use case.</p><h2 id="what-is-multi-tenancy">What Is Multi-Tenancy?</h2><p>Multi-tenancy is like an apartment building for software. Just as one building houses multiple tenants (families or individuals), a multi-tenant application serves multiple customers or organizations using a single instance of the software.</p><p>Multi-tenancy serves multiple "tenants" independently and securely—thereby preventing accidental or unauthorized cross-referencing of private information between different users. This means designing a system that not only understands and retrieves information effectively but also strictly adheres to user-specific data boundaries.&nbsp;</p><h2 id="multi-tenant-applications-pros-and-cons">Multi-Tenant Applications: Pros and Cons&nbsp;</h2><p>Multi-tenancy in RAG applications is vital for several key reasons, all of which deliver benefits:</p><ul><li><strong>Data isolation:</strong> This allows multiple tenants to use the same RAG system while keeping their data separate and secure. This is crucial for maintaining privacy and confidentiality.</li><li><strong>Scalability: </strong>A multi-tenant architecture enables the system to efficiently serve many users or organizations without needing separate deployments for each, leading to better resource utilization and cost-effectiveness.</li><li><strong>Customization: </strong>Different tenants often have unique needs. Multi-tenancy allows for customization of the RAG system for each tenant, such as using tenant-specific knowledge bases or fine-tuning models for particular domains.</li><li><strong>Compliance:</strong> In many industries, regulatory requirements mandate strict data separation. Multi-tenancy helps meet these compliance needs by ensuring data from different tenants doesn't mix.</li><li><strong>Efficient updates: </strong>With a multi-tenant system, updates and improvements can be rolled out to all tenants simultaneously, ensuring everyone benefits from the latest features and security patches.</li><li><strong>Cost-effectiveness: </strong>Sharing infrastructure and resources across tenants can significantly reduce operational costs compared to maintaining separate systems for each user or organization.</li><li><strong>Consistent performance:</strong> A well-designed multi-tenant RAG system can provide more consistent performance across all tenants, as resources are dynamically allocated based on usage.</li></ul><p>To build for long-term performance, flexibility, and efficiency, it’s also helpful to keep in mind the potential challenges of multi-tenant architectures:</p><ul><li>A multi-tenant environment allows multiple access points for users, which can increase the threat of a security breach.</li><li>Serving multiple clients in one instance of an application or database adds an extra level of complexity to the codebase and database maintenance.</li><li>Backup and restoration are more complex, so not all providers offer reliable restoration services.</li><li>The ability to offer tenant-specific customizations is limited, and balancing the shared codebase with unique tenant requirements is often necessary.</li><li>A technical problem on the provider’s end may affect all tenants simultaneously. This may apply to uptime, system upgrades, and other global processes.</li></ul><h2 id="postgresql-benefits-for-multi-tenant-rag-apps">PostgreSQL Benefits for Multi-Tenant RAG Apps</h2><p>PostgreSQL, enhanced with the <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer"><u>pgvector extension</u></a> (the popular open-source extension for vector handling in PostgreSQL), offers a robust solution for implementing multi-tenant RAG apps. Its ability to efficiently store and <a href="https://www.tigerdata.com/learn/understanding-vector-search" rel="noreferrer">search vector</a> embeddings alongside traditional data types makes it an ideal choice for organizations looking to leverage their existing infrastructure.&nbsp;</p><p>Here are the reasons <a href="https://www.tigerdata.com/blog/postgres-for-everything" rel="noreferrer">why PostgreSQL</a> is a good fit for multi-tenant RAG applications:</p><ol><li><strong>Built-in full-text search:</strong> PostgreSQL has robust full-text search capabilities, which are crucial for efficient retrieval in RAG systems.</li><li><strong>JSON support:</strong> PostgreSQL handles JSON data natively, allowing flexible storage of documents and metadata.</li><li><strong>Vector extensions:</strong> extensions like pgvector enable vector similarity searches, essential for embedding-based retrieval.</li><li><strong>Row-level security: </strong>this feature allows fine-grained access control, crucial for multi-tenant setups to ensure data isolation.</li><li><strong>Scalability:</strong> PostgreSQL can handle large datasets and concurrent users, important for growing multi-tenant applications.</li><li><strong>ACID compliance:</strong> <a href="https://www.timescale.com/learn/understanding-acid-compliance" rel="noreferrer">atomicity, consistency, isolation, and durability (ACID) compliance</a> ensures data integrity and consistency across transactions.</li><li><strong>Extensibility: </strong>custom functions and extensions can be added to tailor the database to specific RAG needs.</li><li><strong>Cost-effective:</strong> as an open-source solution, it can be more cost-effective than some cloud-based alternatives.</li></ol><p>Using PostgreSQL for multi-tenant RAG applications also gives you the advantage of <a href="https://cerebralvalley.ai/blog/timescale-is-making-postgresql-better-for-ai-1tiUqSzGsSn76ORZVfMwOk"><u>Timescale Cloud’s stack of open-source extensions</u></a> to easily build and scale RAG, search, and agents applications. In addition to pgvector, this stack includes <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>pgvectorscale</u></a> (which builds on pgvector for enhanced performance and scale) and <a href="https://www.timescale.com/ai"><u>pgai</u></a> (which brings embedding creation and large language model completions to the database, giving more <a href="https://timescale.ghost.io/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers/"><u>PostgreSQL developers the skills of AI engineers</u></a>). Both extensions complement pgvector and rely on its capabilities.</p><h3 id="implementing-rag-with-postgresql">Implementing RAG with PostgreSQL</h3><p>Implementing RAG with PostgreSQL involves a multi-step process that leverages the database's vector storage capabilities. The workflow typically includes ingesting and chunking data, converting text into <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a> using an embedding model, and storing these vectors in PostgreSQL using pgvector.&nbsp;</p><p>When a user query is received, the system retrieves the most relevant data from the vector database based on similarity search. This retrieved information is then combined with the user's question and any additional context to create a comprehensive prompt for the large language model (LLM). The LLM processes this enriched prompt and generates a response, which is then returned to the user, providing a more accurate and contextually relevant answer.</p><h2 id="strategies-for-dealing-with-multi-tenancy-for-rag-in-postgresql">Strategies for Dealing With Multi-Tenancy for RAG in PostgreSQL</h2><p>To pick the right strategy for your multi-tenant RAG application with PostgreSQL, consider your requirements (and your users’ or customers’ requirements) for shared resources, data separation, customization, scalability, and of course, costs.&nbsp;</p><p>PostgreSQL offers four levels of multi-tenancy implementation—table, schema, logical database, and database service—each suitable for distinct use case scenarios and each with its pros and cons. Here’s a comparative overview of each level.&nbsp;&nbsp;</p><ol><li><strong>Table-level multi-tenancy</strong> gives each tenant its own table. This is simple but may lead to data isolation concerns. Table-level isolation works well for simple apps with shared data.</li><li><strong>Schema-level separation </strong>gives each tenant its own schema in the same logical database. It offers better isolation with minimal operational and cost overhead, balancing isolation and efficiency for most use cases.</li><li><strong>Logical database separation</strong> gives each tenant its own logical database in a database instance. Separation at the logical database level provides stronger separation but increases complexity. Logical databases suit clients needing stricter separation.</li><li><strong>Database service separation </strong>gives each tenant their own database service. Service-level separation is ideal for high-security scenarios or clients with unique needs. It offers the highest isolation but at the cost of increased resources and management overhead.</li></ol><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg" class="kg-image" alt="&nbsp;Overview of approaches for handling multi-tenancy in PostgreSQL" loading="lazy" width="1654" height="931" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 1600w, https://timescale.ghost.io/blog/content/images/2024/10/strategies-for-multitenant-rag-in-postgres-pgvector.jpg 1654w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">&nbsp;Overview of approaches for handling multi-tenancy in PostgreSQL</span></figcaption></figure><h2 id="conclusion">Conclusion</h2><p>By carefully considering the optimal use cases, pros, and cons of each multi-tenancy approach and aligning them with your application's needs, you can create a scalable, secure, and performant RAG system in PostgreSQL. As RAG technologies continue to evolve, PostgreSQL's extensibility and strong community support ensure that it will remain an adaptable platform for building sophisticated multi-tenant AI applications.&nbsp;</p><p>Additionally, PostgreSQL on Timescale Cloud allows you to store your relational, <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, events, semi-structured, <em>and</em> vector data in one place. This removes the operational complexity of managing a separate vector database. It can deliver <a href="https://timescale.ghost.io/blog/pgvector-vs-pinecone/"><u>performance, rich capabilities, and user experience</u></a> equal to or better than a specialized tool. <br><br><a href="https://console.cloud.timescale.com/signup" rel="noreferrer">Create a free account to try Timescale Cloud's open-source AI stack today</a> (including pgvector, <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a>, and pgvectorscale).</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai]]></title>
            <description><![CDATA[Enterprise-ready LLMs from Cohere, now available in the pgai PostgreSQL extension.]]></description>
            <link>https://www.tigerdata.com/blog/build-search-and-rag-systems-on-postgresql-using-cohere-and-pgai</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/build-search-and-rag-systems-on-postgresql-using-cohere-and-pgai</guid>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Fri, 09 Aug 2024 18:01:55 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2024/08/Cohere_pgai.png">
            </media:content>
            <content:encoded><![CDATA[<p><em>Enterprise-ready LLMs from Cohere—now available in the pgai PostgreSQL extension.</em></p><p><a href="https://cohere.com" rel="noreferrer"><strong>Cohere</strong></a> is a leading generative AI company in the field of large language models (LLMs) and retrieval-augmented generation (RAG) systems. What sets Cohere apart from other model developers is its unwavering focus on enterprise needs and support for multiple languages.&nbsp;</p><h3 id="cohere-models">Cohere models</h3><p>Cohere has quickly gained traction among businesses and developers alike, thanks to its suite of models that cater to key steps of the AI application-building process: text embedding, result reranking, and reasoning. Here’s an overview of Cohere models:</p><ul><li><a href="https://cohere.com/embed"><strong><u>Cohere Embed</u></strong></a><strong>:</strong> A leading text representation model supporting over 100 languages, with the latest version (Embed v3) capable of evaluating document quality and relevance to queries. Cohere Embed is ideal for <a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer">semantic search</a>, retrieval-augmented generation (RAG), clustering, and classification tasks.</li><li><a href="https://cohere.com/rerank"><strong><u>Cohere Rerank</u></strong></a><strong>:</strong> A model that significantly improves search quality for any keyword or vector search system with minimal code changes. It’s optimized for high throughput and reduced compute requirements while leveraging Cohere's embedding performance for accurate reranking. Cohere rerank is system agnostic, meaning it can be used with any vector search system, including PostgreSQL.</li><li><a href="https://cohere.com/command"><strong><u>Cohere Command</u></strong></a>: A family of highly scalable language models optimized for enterprise use. Command supports RAG, multi-language use, tool use, and citations. Command R+ is the most advanced model, as it’s optimized for conversational interaction and long-context tasks. Command R is well suited for simpler RAG and single-step tool use tasks and is the most cost-effective choice out of the Command family.&nbsp;</li></ul><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🎉</div><div class="kg-callout-text">Today, we’re thrilled to announce that PostgreSQL developers can now harness the power of Cohere's enterprise-grade language models directly on PostgreSQL data using the pgai PostgreSQL extension.&nbsp;</div></div><h3 id="what-is-pgai">What is pgai?</h3><p><a href="https://github.com/timescale/pgai"><u>Pgai</u></a> is an open-source <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> that brings AI models closer to your data, simplifying tasks such as embedding creation, text classification, semantic search, and retrieval-augmented generation on data stored in PostgreSQL.</p><p><strong>Pgai supports the entire suite of Cohere Embed, Rerank, and Command models.</strong> This means that developers can now:</p><ul><li>Create embeddings using Cohere’s Embed model for data inside PostgreSQL tables, all without having pipe data out and back into the database.</li><li>Build hybrid search systems for higher quality results in search and RAG applications using Cohere Rerank, combining vector search in pgvector and PostgreSQL full-text search.</li><li>Perform tasks like classification, summarization, <a href="https://www.tigerdata.com/blog/automating-data-enrichment-in-postgresql-with-openai" rel="noreferrer">data enrichment</a>, and other reasoning tasks on data in PostgreSQL tables using Cohere Command models.</li><li>Build highly accurate RAG systems completely in SQL, leveraging the Cohere Embed, Rerank, and Command models altogether.</li></ul><p>We built pgai to <a href="https://timescale.ghost.io/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers/"><u>give more PostgreSQL developers AI Engineering superpowers</u></a>. Pgai makes it easier for database developers familiar with PostgreSQL to become "AI engineers" by providing familiar SQL-based interfaces to AI functionalities like embedding, creating, and model usage.</p><p>While Command is the flagship model for building scalable, production-ready AI applications, <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> supports a variety of models in the Cohere lineup. This includes specialized embedding models that support over 100 languages, enabling developers to select the most suitable model for their specific enterprise use case. </p><p>Whether it's building internal AI knowledge assistants, security documentation AI, customer feedback analysis, or employee support systems, Cohere's models integrated with pgai and PostgreSQL offer powerful solutions.</p><p><strong>Getting started with Cohere models in pgai</strong></p><p>Ready to start building with Cohere's suite of powerful language models and PostgreSQL? Pgai is open source under the PostgreSQL License and is available for immediate use in your AI projects. You can find installation instructions on the pgai <a href="https://github.com/timescale/pgai/?ref=timescale.com"><u>GitHub repository</u></a>. You can also access pgai (alongside pgvector and <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>pgvectorscale</u></a>) on any database service on <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Timescale’s Cloud PostgreSQL platform</u></a>. If you’re new to Timescale, you can get started with a <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>free cloud PostgreSQL database here</u></a>.</p><p>Once connected to your database, create the pgai extension by running:</p><pre><code class="language-postgresql">CREATE EXTENSION IF NOT EXISTS ai CASCADE;</code></pre><p><strong>Join the Postgres AI Community</strong></p><p>Have questions about using Cohere models with pgai? Join the <a href="https://discord.gg/KRdHVXAmkp?ref=timescale.com"><u>Postgres for AI Discord</u></a>,&nbsp; where you can share your projects, seek help, and collaborate with a community of peers.&nbsp; You can also <a href="https://github.com/timescale/pgai/issues?ref=timescale.com"><u>open an issue on the pgai GitHub</u></a> (and while you’re there, stars are always appreciated ⭐).</p><p>Next, let's explore the benefits of using Cohere models for building enterprise AI applications. And we’ll close with a real-world example of using Cohere models to build a hybrid search system, combining pgvector semantic search with PostgreSQL full-text search.&nbsp;</p><p>Let's dive in...</p><h2 id="why-use-cohere-models-for-rag-and-search-applications">Why Use Cohere Models for RAG and Search Applications?</h2><p>Cohere's suite of models offers several advantages for enterprise-grade RAG and search applications:</p><p><strong>High-quality results</strong>: Cohere models excel in understanding business language, providing relevant content, and tackling complex enterprise challenges.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcxGJoF6tCFIRf-wLKvkqUKCRp7S7WekDS6qQuySntrZp68pBzl8VxVeIjelY54kcjpJd9wmBvBS9TjMdWkXAeceM6V40w42zsR7IbWlZCT1OOyePr0wehPrJCa0DxilDhcrbM4iXDMElXNxl8NYymFew?key=6JbOGbzT9OJGF_xgkuNFdg" class="kg-image" alt="" loading="lazy" width="1600" height="704"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Impact of Cohere Rerank:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> Semi-structured retrieval accuracy based Recall@5 on TMDB-5k-Movies, WikiSQL, nq-tables, and Cohere annotated datasets (higher is better). Source: </em></i><a href="https://cohere.com/blog/rerank-3"><u><i><em class="italic underline" style="white-space: pre-wrap;">Cohere blog</em></i></u></a><i><em class="italic" style="white-space: pre-wrap;">.</em></i></figcaption></figure><p><strong>Multi-language support:</strong> With support for over 100 languages, Cohere models are ideal for global enterprises. For example, Command R+ is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic.&nbsp; Additionally, pre-training data has been included for the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, and Persian.</p><p><strong>Flexible deployment:</strong> Cohere’s flexible deployment options allow businesses to bring Cohere's models to their own data, either on their servers or in private/commercial clouds, addressing critical security and compliance concerns. This flexibility, combined with Cohere's robust API and partnerships with major cloud providers like Amazon Web Services (through the Bedrock AI platform), ensures seamless integration and improved scalability for enterprise applications.</p><p><strong>Enterprise focus:</strong> Strategic partnerships with industry leaders like Fujitsu, Oracle, and McKinsey &amp; Company underscore Cohere's enterprise-centric approach.</p><p>Cohere models enable sophisticated enterprise AI applications such as:</p><ul><li>Investment research assistants</li><li>Support chatbots and Intelligent support copilots</li><li>Executive AI assistants</li><li>Document summarization tools</li><li>Knowledge and project staffing assistants</li><li>Regulatory compliance monitoring</li><li>Sentiment analysis for brand management</li><li>Research and development assistance</li></ul><p>Cohere's models offer a powerful, scalable, and easily implementable solution for enterprise search and RAG applications, balancing high accuracy with efficient performance.</p><h2 id="tutorial-build-search-and-rag-systems-on-postgresql-using-cohere-and-pgai">Tutorial: Build Search and RAG Systems on PostgreSQL Using Cohere and Pgai</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcDydTpU8JCBmeEgJZFg_NMrdqAUeWGAoM8bfvt_iSVgWHcaDZNYx-ePIOV3i1nLg2ht_NBAaMJd02A6gD4Ao2xtZ12AbzYzBa_jD2La5VaRJmdn6DlWEF7VzIBP_jI32CEdMhMo47uv41RXM1IzT-xyZhH?key=6JbOGbzT9OJGF_xgkuNFdg" class="kg-image" alt="" loading="lazy" width="1600" height="1048"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Process diagram showing how pgai combines full text, semantic search</em></i><span style="white-space: pre-wrap;">,</span><i><em class="italic" style="white-space: pre-wrap;"> and Cohere’s models to find the most relevant results to a user query using hybrid search.</em></i></figcaption></figure><p>Let’s look at an example of why Cohere's support in pgai is such a game-changer. We’ll perform a hybrid search over a corpus of news articles, combining PostgreSQL full-text search with pgvector’s semantic search, leveraging Cohere Embed and Rerank models in the process.</p><p>Here's what we'll cover:</p><ul><li><strong>Setting up your environment: </strong>Creating a Python virtual environment, installing necessary libraries, and setting up a PostgreSQL database with the pgai extension.</li><li><strong>Preparing your data: </strong>Creating a table to store news articles and loading sample data from the CNN/Daily Mail news dataset.</li><li><strong>Generating vector embeddings using pgai:</strong> Using Cohere's Embed model to create vector representations of text.</li><li><strong>Implementing full text and semantic search capabilities:</strong> Setting up PostgreSQL's full-text search for keyword matching and creating a vector search index for semantic search.</li><li><strong>Performing searches:</strong> Executing keyword searches using PostgreSQL's full-text search and semantic searches on embeddings using pgvector <a href="https://www.tigerdata.com/learn/hnsw-vs-diskann" rel="noreferrer">HNSW</a>.</li><li><strong>Performing hybrid search: </strong>Combining keyword and semantic search results by using Cohere's Rerank model to improve result relevance.</li></ul><p>By the end of this tutorial, you'll have a powerful hybrid search system that provides highly relevant search results, leveraging both traditional keyword search and semantic search techniques.</p><h3 id="setting-up-your-environment">Setting up your environment</h3><p>To begin, we'll set up a Python virtual environment and install the necessary libraries. This ensures a clean, isolated environment for our project.</p><pre><code class="language-bash">mkdir cohere
cd cohere
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`

pip install datasets "psycopg[binary]" python-dotenv
</code></pre><p><br>This setup:</p><ul><li>Creates a new directory for our project</li><li>Sets up a Python virtual environment</li><li>Activates the virtual environment</li><li>Installs the required Python packages:<ul><li><code>datasets</code>: For loading our sample dataset</li><li><code>psycopg</code>: PostgreSQL adapter for Python</li><li><code>python-dotenv</code>: For managing environment variables</li></ul></li></ul><p>Next, create a <code>.env</code> file in your project directory to store your database connection string and Cohere API key:</p><pre><code class="language-bash">DB_URL=postgres://username:password@localhost:5432/your_database
COHERE_API_KEY=your_cohere_api_key_here
</code></pre><p>Replace the placeholders with your actual database credentials and Cohere API key.</p><h3 id="setting-up-your-postgresql-database">Setting up your PostgreSQL database</h3><p>Now, let's set up our PostgreSQL database with the necessary extensions and table structure. We’ll create a table to hold our news articles. The <code>embedding</code> column will hold our embeddings which we’ll create using the Cohere Embed model. The <code>tsv</code> column will hold our full-text-search vectors.</p><p>Connect to your PostgreSQL database and run the following SQL commands:</p><pre><code class="language-postgresql">-- create the pgai extension
create extension if not exists ai cascade;

-- create a table for the news articles
-- the embedding column will store embeddings from cohere
-- the tsv column will store the full-text-search vector
create table cnn_daily_mail
( id bigint not null primary key generated by default as identity
, highlights text
, article text
, embedding vector(1024)
, tsv tsvector generated always as (to_tsvector('english', article)) stored
);

-- index the full-text-search vector
create index on cnn_daily_mail using gin (tsv);
</code></pre><p>This SQL script:</p><ul><li>Creates the <code>pgai</code> extension</li><li>Sets up a table to store our news articles, including columns for the article text, embeddings, and a full-text <a href="https://www.tigerdata.com/learn/understanding-vector-search" rel="noreferrer">search vector</a></li><li>Creates an index on the full-text search vector for efficient keyword searches</li></ul><h3 id="load-the-dataset">Load the dataset</h3><p>Now that our database is set up, we'll load sample data from the CNN/Daily Mail dataset. We'll use Python to download the dataset and efficiently insert it into our PostgreSQL table using the <code>copy</code> command with the <code>binary</code> format.</p><p>Create a new file called <code>load_data.py</code> in your project directory and add the following code:</p><pre><code class="language-python">import os
from dotenv import load_dotenv
from datasets import load_dataset
import psycopg


# Load environment variables
load_dotenv()
DB_URL = os.environ["DB_URL"]
COHERE_API_KEY = os.environ["COHERE_API_KEY"]

# Load and prepare the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")
test = dataset["test"]
test = test.shuffle(seed=42).select(range(0,1000))

# Insert data into PostgreSQL
with psycopg.connect(DB_URL) as con:
   with con.cursor(binary=True) as cur:
       with cur.copy("copy public.cnn_daily_mail (highlights, article) from stdin (format binary)") as cpy:
           cpy.set_types(['text', 'text'])
           for row in test:
               cpy.write_row((row["highlights"], row["article"]))

print("Data loading complete!")
</code></pre><p>This script:</p><ol><li>Loads environment variables from the .env file</li><li>Downloads the CNN/Daily Mail dataset</li><li>Selects a random subset of 1,000 articles from the test set</li><li>Efficiently inserts the data into our PostgreSQL table using the binary COPY command</li></ol><p>To run the script and load the data:</p><pre><code class="language-bash">python load_data.py</code></pre><p>This approach is much faster than inserting rows one by one, especially for larger datasets. After running the script, you should have 1000 news articles loaded into your cnn_daily_mail table, ready for the next steps in our tutorial.</p><p><strong>Note</strong>: Before running this script, make sure your .env file is properly set up with the correct database URL.</p><h3 id="create-vector-embeddings-with-cohere-embed-and-pgai"><br>Create vector embeddings with Cohere Embed and pgai</h3><p>In this step, we'll use Cohere's Embed model to generate <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a> for our news articles. These embeddings will enable semantic search capabilities.</p><p>First, set your Cohere API key in your PostgreSQL session:</p><pre><code class="language-postgresql">select set_config('ai.cohere_api_key', '&lt;YOUR-API-KEY&gt;', false) is not null as set_cohere_api_key</code></pre><p>The pgai extension will use this key by default for the duration of your database session.</p><p>We'll explore two methods to generate embeddings: a simple approach and a more production-ready solution. Choose method 1 if you’re just experimenting and method 2 if you want to see how you’d run this in production.<br></p><p><strong>Method 1: Easy mode (for small datasets)</strong></p><p>This is the "easy-mode" approach to generating embeddings for each news article.&nbsp;</p><pre><code class="language-postgresql">-- this is the easy way to create the embeddings
-- but it's one LONG-running update statement
update cnn_daily_mail set embedding = ai.cohere_embed('embed-english-v3.0', article, input_type=&gt;'search_document');
</code></pre><p>This method is straightforward but can be slow for large datasets. The downside to this is that each call to Cohere is relatively slow. This makes the whole statement long-running.</p><p><br><strong>Method 2: Production mode (recommended for larger datasets)</strong></p><p>This method processes rows one at a time, allowing for better concurrency and error handling. We select and lock only a single row at a time, and we commit our work as we go along.</p><pre><code class="language-postgresql">-- this is a more production-appropriate way to generate the embeddings
-- we only lock one row at a time and we commit each row immediately
do $$
declare
    _id bigint;
    _article text;
    _embedding vector(1024);
begin
    loop
        select id, article into _id, _article
        from cnn_daily_mail
        where embedding is null
        for update skip locked
        limit 1;

        if not found then
            exit;
        end if;

        _embedding = ai.cohere_embed('embed-english-v3.0', _article, input_type=&gt;'search_document');
        update cnn_daily_mail set embedding = _embedding where id = _id;
        commit;
    end loop;
end;
$$;
</code></pre><p>Note that both methods above use the <a href="https://github.com/timescale/pgai/blob/main/docs/cohere.md#cohere_embed"><u>cohere_embed() function</u></a> provided by the pgai extension to generate embeddings for each article in the cnn_daily_mail table using the Cohere <a href="https://docs.cohere.com/docs/cohere-embed" rel="noreferrer">Embed v3 embedding model</a>.</p><p>After running one of the above methods, verify that all rows have embeddings:</p><pre><code class="language-postgresql">-- this should return 0
select count(*) from cnn_daily_mail where embedding is null;
</code></pre><p><strong>Create a vector search index</strong></p><p>To optimize vector similarity searches, create an index on the embedding column:</p><pre><code class="language-postgresql">-- index the embeddings
create index on cnn_daily_mail using hnsw (embedding vector_cosine_ops);
</code></pre><p>This index uses the pgvector <a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw" rel="noreferrer">HNSW</a> (<a href="https://www.tigerdata.com/blog/vector-database-basics-hnsw" rel="noreferrer">hierarchical navigable small world</a>) algorithm, which is efficient for nearest-neighbor searches in high-dimensional spaces.</p><p>With these steps completed, your database is now set up for both keyword-based and semantic searches. In the next section, we'll explore how to perform these searches using <a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> and Cohere's models.</p><h2 id="perform-a-keyword-search-in-postgresql"><br>Perform a Keyword Search in PostgreSQL</h2><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">⚠️</div><div class="kg-callout-text">Note to reader: Unfortunately, the news dataset is quite depressing, so please forgive the negativity in the keywords and prompts below. All terms are chosen purely for illustrative and educational purposes.</div></div><p>PostgreSQL's full-text search capabilities allow us to perform complex keyword searches efficiently. We'll use the <code>tsv</code> column we created earlier, which contains the tsvector representation of each article.</p><pre><code class="language-postgresql">-- a full-text-search
-- must include "death" OR "kill"
-- must include "police" AND "car" AND "dog"
select article
from cnn_daily_mail
where article @@ to_tsquery('english', '(death | kill) &amp; police &amp; car &amp; dog')
;
</code></pre><p>Let's break down this query:</p><ul><li><code>@@</code> is the text search match operator in PostgreSQL.</li><li><code>to_tsquery('english', '...')</code> converts our search terms into a tsquery type.</li><li>The search logic is: (death OR kill) AND police AND car AND dog.</li></ul><p>This query will return articles that contain:</p><ol><li>Either "death" or "kill" (or their variations)</li><li>AND "police" (or variations)</li><li>AND "car" (or variations)</li><li>AND "dog" (or variations)</li></ol><p>This method of searching is fast and efficient, especially for exact keyword matches inside the article text. However, it doesn't capture semantic meaning or handle synonyms well. In the next section, we'll explore how to perform semantic searches using the embeddings we created earlier.</p><h2 id="perform-a-semantic-search-in-postgresql-with-pgvector-pgai-and-cohere-embed">Perform a Semantic Search in PostgreSQL With pgvector, pgai, and Cohere Embed </h2><p>Semantic search allows us to find relevant articles based on the meaning of a query rather than just matching keywords. We'll use Cohere's Embed model to convert our search query into a <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embedding</a>, then find the most similar article embeddings using pgvector.</p><pre><code class="language-postgresql">-- get the 15 most relevant stories to our question
with q as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=&gt;'search_query'
   ) as q
)
select article
from cnn_daily_mail
order by embedding &lt;=&gt; (select q from q limit 1)
limit 15
;
</code></pre><p>Let's break this query down:</p><ol><li>We use a <a href="https://www.timescale.com/learn/how-to-use-common-table-expression-sql" rel="noreferrer">CTE (Common Table Expression) </a>named <code>q</code> to generate the embedding for our search query using Cohere's model (also named <code>q</code>).</li><li><code>ai.cohere_embed()</code> is a function provided by pgai that interfaces with Cohere's API to create embeddings. See more in the <a href="https://github.com/timescale/pgai/blob/main/docs/cohere.md#cohere_embed" rel="noreferrer">pgai docs</a>.</li><li>We specify <code>input_type =&gt; 'search_query'</code> to optimize the embedding for search queries.</li><li>In the main query, we order the results by the cosine distance (<code>&lt;=&gt;</code>) between each article's embedding and our query embedding.</li><li>The <code>LIMIT 15</code> clause returns the top 15 most semantically similar articles.</li></ol><p>This semantic search can find relevant articles even if they don't contain the exact words used in the query. It understands the context and meaning of the search query and matches it with semantically similar content.</p><h2 id="performing-hybrid-search-using-pgai-and-cohere-rerank">Performing Hybrid Search Using Pgai and Cohere Rerank&nbsp;</h2><p>Hybrid search combines the strengths of both keyword and semantic searches and then uses Cohere's Rerank model to improve the relevance of the results further. This approach can provide more accurate and contextually relevant results than either method alone.</p><p>Here's how to perform a hybrid search with reranking:</p><pre><code class="language-postgresql">with full_text_search as
(
   select article
   from cnn_daily_mail
   where article @@ to_tsquery('english', '(death | kill) &amp; police &amp; car &amp; dog')
   limit 15
)
, vector_query as
(
   select ai.cohere_embed
   ('embed-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , input_type=&gt;'search_query'
   ) as query_embedding
)
, vector_search as
(
   select article
   from cnn_daily_mail
   order by embedding &lt;=&gt; (select query_embedding from vector_query limit 1)
   limit 15
)
, rerank as
(
   select ai.cohere_rerank
   ( 'rerank-english-v3.0'
   , 'Show me stories about police reports of deadly happenings involving cars and dogs.'
   , (
       select jsonb_agg(x.article)
       from
       (
           select *
           from full_text_search
           union
           select * from vector_search
       ) x
     )
   , top_n =&gt; 5
   , return_documents =&gt; true
   ) as response
)
select
  x.index
, x.document-&gt;&gt;'text' as article
, x.relevance_score
from rerank
cross join lateral jsonb_to_recordset(rerank.response-&gt;'results') x(document jsonb, index int, relevance_score float8)
order by relevance_score desc
;
</code></pre><p>Let's break down this query:</p><ol><li><code>full_text_search</code>: Performs a keyword search using PostgreSQL's full-text search capabilities.</li><li><code>vector_query</code>: Creates an embedding for our search query using Cohere's Embed model.</li><li><code>vector_search</code>: Performs a semantic search using the query embedding.</li><li><code>combined_results</code>: Combines the results from both keyword and semantic searches.</li><li><code>reranked_results</code>: Uses Cohere's Rerank model to reorder the combined results based on relevance to the query.</li><li>The final <code>SELECT</code> statement extracts and formats the reranked results.</li></ol><p>This hybrid approach offers several advantages:</p><ul><li>It captures both exact keyword matches and semantically similar content.</li><li>The reranking step helps to prioritize the most relevant results.</li><li>It can handle cases where either keyword or semantic search alone might miss relevant articles.</li></ul><p>To use this query effectively:</p><ol><li>Adjust the <code>to_tsquery</code> parameters to match your specific keyword search needs.</li><li>Modify the natural language query in both the <code>ai.cohere_embed</code> and <code>ai.cohere_rerank</code> functions to match your search intent.</li><li>Experiment with the <code>LIMIT</code> values and the <code>top_n</code> parameter to balance between recall and precision.</li></ol><p>Remember, while this approach can provide highly relevant results, it does involve multiple API calls to Cohere (for embedding and reranking), which may impact performance for large result sets or high-volume applications.&nbsp;</p><p>This hybrid search method demonstrates the power of combining traditional database search techniques with advanced AI models, all within your PostgreSQL database using pgai.</p><h2 id="get-started-with-cohere-and-pgai-today">Get Started With Cohere and Pgai Today</h2><p>The integration of Cohere’s Command, Embed, and Rerank model into pgai marks a significant milestone in our vision to help PostgreSQL evolve into an <a href="https://timescale.ghost.io/blog/making-postgresql-a-better-ai-database/"><u>AI database</u></a>. By bringing these state-of-the-art language models directly into your database environment, you can unlock new levels of efficiency, intelligence, and innovation in your projects.</p><p>Pgai is open source under the PostgreSQL License and is available for you to use in your AI projects today. You can find installation instructions on the <a href="https://github.com/timescale/pgai/?ref=timescale.com"><u>pgai GitHub repository</u></a>. You can also access pgai on any database service on <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Timescale’s cloud PostgreSQL platform</u></a>. If you’re new to Timescale, you can get started with a <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>free cloud PostgreSQL database here</u></a>.<br></p><p>Pgai is an effort to enrich the PostgreSQL ecosystem for AI. If you’d like to help, here’s how you can get involved:</p><ul><li>Got questions about using Cohere models in pgai? Join the <a href="https://discord.gg/Q8TajvAPRN?ref=timescale.com"><u>Postgres for AI Discord</u></a>, a community of developers building AI applications with PostgreSQL. Share what you’re working on, and help or get helped by a community of peers.</li><li>Share the news with your friends and colleagues: Share our posts about pgai on <a href="https://x.com/TimescaleDB?ref=timescale.com"><u>X/Twitter</u></a>, <a href="https://www.linkedin.com/company/timescaledb/?ref=timescale.com"><u>LinkedIn</u></a>, and Threads. We promise to RT back.</li><li>Submit issues and feature requests: We encourage you to submit issues and feature requests for functionality you’d like to see, bugs you find, and suggestions you think would improve pgai. Head over to the <a href="https://github.com/timescale/pgai/?ref=timescale.com"><u>pgai GitHub repo</u></a> to share your ideas.</li><li>Make a contribution: We welcome community contributions for pgai. Pgai is written in Python and PL/Python. Let us know which models you want to see supported, particularly for open-source embedding and generation models. See the <a href="https://github.com/timescale/pgai/blob/main/CONTRIBUTING.md?ref=timescale.com"><u>pgai GitHub for instructions to contribute</u></a>.</li><li>Offer the pgai extension on your PostgreSQL cloud: Pgai is an open-source project under the <a href="https://github.com/timescale/pgvectorscale/blob/main/LICENSE?ref=timescale.com"><u>PostgreSQL License</u></a>. We encourage you to offer pgai on your managed PostgreSQL database-as-a-service platform and can even help you spread the word. Get in touch via our <a href="https://www.timescale.com/contact?ref=timescale.com"><u>Contact Us form</u></a> and mention pgai to discuss further.</li></ul><p>We're excited to see what you'll build with PostgreSQL and pgai!</p><p><br></p><p><br><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Nearest Neighbor Indexes: What Are IVFFlat Indexes in Pgvector and How Do They Work]]></title>
            <description><![CDATA[A primer on the pgvector’s Inverted File Flat (ivfflat) algorithm for approximate nearest neighbor search. ]]></description>
            <link>https://www.tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[pgvector]]></category>
            <category><![CDATA[AI]]></category>
            <dc:creator><![CDATA[Matvey Arye]]></dc:creator>
            <pubDate>Fri, 30 Jun 2023 13:03:10 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor_hero_4.png">
            </media:content>
            <content:encoded><![CDATA[<p>The rising popularity of ChatGPT, OpenAI, and applications of Large Language Models (LLMs) has brought the concept of approximate nearest neighbor search (ANN) to the forefront and sparked a renewed interest in vector databases due to the use of embeddings. <a href="https://platform.openai.com/docs/guides/embeddings">Embeddings</a> are mathematical representations of phrases that capture the semantic meaning as a vector of numerical values. </p><p>What makes this representation fascinating—and useful—is that phrases with similar meanings will have similar vector representations, meaning the distance between their respective vectors will be small. We recently discussed one application of these embeddings, <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/">retrieval-augmented generation</a>—augmenting base LLMs with knowledge that it wasn’t trained on—but there are numerous other applications as well.</p><h3 id="semantic-similarity-search">Semantic similarity search</h3><p>One common application of embeddings is precisely <a href="https://en.wikipedia.org/wiki/Semantic_search">semantic similarity search</a>. The basic concept behind this approach is that if I have a knowledge library consisting of various phrases and I receive a question from a user, I can locate the most relevant information in my library by finding the data that is most similar to the user's query. <br></p><p>This is in contrast to lexical or full-text search, which only returns exact matches for the query. The remarkable aspect of this technique is that, since the embeddings represent the semantics of the phrase rather than its specific wording, I can find pertinent information even if it is expressed using completely different words!<br></p><h3 id="the-challenge-of-speed-at-scale">The challenge of speed at scale</h3><p>Semantic similarity search involves calculating an embedding for the user's question and then searching through my library to find the K most relevant items related to that question—these are the K items whose embeddings are closest to that of the question. However, when dealing with a large library, it becomes crucial to perform this search efficiently and swiftly. In the realm of vector databases, this problem is referred to as "Finding the k nearest neighbors" (<a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">KNN</a>).<br></p><p>This post discusses a method to enhance the speed of this search when utilizing PostgreSQL and <a href="https://www.timescale.com/ai">pgvector</a> for storing <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embeddings</a>: the <a href="https://github.com/pgvector/pgvector#indexing">Inverted File Flat (IVFFlat)</a> algorithm for approximate nearest neighbor search. We’ll cover why IVFFlat is useful, how it works, and best practices for using it in pgvector for fast similarity search over embeddings vectors. </p><p>Let’s go!</p><p><strong>P.S. </strong>If you’re looking for the fastest vector search index on PostgreSQL, <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/" rel="noreferrer"><u>check out pgvectorscale</u></a>.</p><h2 id="what-are-ivfflat-indexes">What Are IVFFlat Indexes?</h2><p>IVFFlat indexes, short for Inverted File with Flat Compression, are a type of vector index used in PostgreSQL's <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer">pgvector extension</a> to speed up similarity searches to find vectors that are close to a given query. This index type uses approximate nearest neighbor search (ANNS) to provide fast searches.&nbsp;&nbsp;</p><p>These indexes work by dividing the vectors into multiple lists, known as clusters. Each cluster represents a region of similar vectors, and an inverted index is built to map each region to its corresponding vectors. When a query comes in, the nearest clusters to the query are identified and only the vectors in those clusters are searched. Thus, this approach significantly reduces the scope of similarity searches by excluding all the vectors that are not in the clusters that are close to the query.</p><p></p><h2 id="why-use-the-ivfflat-index-in-pgvector">Why Use the IVFFlat Index in Pgvector</h2><p>Searching for the k-nearest neighbors is not a novel problem for PostgreSQL. <a href="https://docs.timescale.com/use-timescale/latest/extensions/postgis/">PostGIS</a>, a <a href="https://www.tigerdata.com/blog/top-8-postgresql-extensions" rel="noreferrer">PostgreSQL extension</a> for handling location data, stores its data points as two-dimensional vectors (longitude and latitude). Locating nearby locations is a crucial query in that domain. <br></p><p>PostGIS tackles this challenge by employing an index known as an R-Tree, which yields precise results for k-nearest neighbor queries. Similar techniques, such as KD-Trees and Ball Trees, are also employed for this type of search in other databases.<br></p><h3 id="the-curse-of-dimensionality">"The curse of dimensionality"</h3><p>However, there's a catch. These approaches cease to be effective when dealing with data larger than approximately 10 dimensions due to the "curse of dimensionality." Cue the ominous music! Essentially, as you add more dimensions, the available space increases exponentially, resulting in exponentially sparser data. This reduced density renders existing indexing techniques, like the aforementioned R-Tree, KD-Trees, and Ball Trees, which rely on partitioning the space, ineffective. (To learn more, I suggest these two videos: <a href="https://www.youtube.com/watch?v=BbYV8UfMJSA">1</a>, <a href="https://www.youtube.com/watch?v=E1_WCdUAtyE">2</a>). <br></p><p>Given that embeddings often consist of more than a thousand dimensions—OpenAI’s are 1,536—new techniques had to be developed. There are no known exact algorithms for efficiently searching in such high-dimensional spaces. Nevertheless, there are excellent <em>approximate</em> algorithms that fall into the category of approximate nearest neighbor algorithms. Numerous such algorithms exist, but in this article, we will delve into the Inverted File Flat or IVFFlat algorithm, which is provided by pgvector.</p><h2 id="how-the-ivfflat-index-works-in-pgvector">How the IVFFlat Index Works in pgvector</h2><p></p><h3 id="how-ivfflat-divides-the-space">How IVFFlat divides the space</h3><p>To gain an intuitive understanding of how IVFFlat works, let's consider a set of vectors represented in a two-dimensional space as the following points:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---1.png" class="kg-image" alt="A set of vectors represented as points in two dimensions" loading="lazy" width="1640" height="1040" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---1.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---1.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---1.png 1600w, https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">A set of vectors represented as points in two dimensions</em></i></figcaption></figure><p>In the IVFFlat algorithm, the first step involves applying k-means clustering to the vectors to find cluster centroids. In the case of the given vectors, let's assume we perform k-means clustering and identify four clusters with the following centroids.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---2-1.png" class="kg-image" alt="After k-means clustering, we identify four clusters indicated by the colored triangles." loading="lazy" width="1640" height="1040" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 1600w, https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---2-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><i><em class="italic" style="white-space: pre-wrap;">After k-means clustering, we identify four clusters indicated by the colored triangles</em></i></figcaption></figure><p>After computing the centroids, the next step is to assign each vector to its nearest centroid. This is accomplished by calculating the distance between the vector and each centroid and selecting the centroid with the smallest distance as the closest one. This process conceptually maps each point in space to the closest centroid based on proximity.<br></p><p>By establishing this mapping, the space becomes divided into distinct regions surrounding each centroid (technically, this kind of division is called a <a href="https://en.wikipedia.org/wiki/Voronoi_diagram">Voronoi Diagram</a>). Each region represents a cluster of vectors that exhibit similar characteristics or are close in semantic meaning. <br></p><p>This division enables efficient organization and retrieval of approximate nearest neighbors during subsequent search operations, as vectors within the same region are likely to be more similar to each other than those in different regions.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---3-1.png" class="kg-image" alt=" The process of assigning each vector to its closest centroid conceptually divides the space into distinct regions that surround each centroid" loading="lazy" width="1640" height="1040" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 1600w, https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---3-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The process of assigning each vector to its closest centroid conceptually divides the space into distinct regions that surround each centroid</span></figcaption></figure><h3 id="building-the-ivfflat-index-in-pgvector"><br>Building the IVFFlat index in pgvector</h3><p>IVFFlat proceeds to create an <a href="https://en.wikipedia.org/wiki/Inverted_index">inverted index</a> that maps each centroid to the set of vectors within the corresponding region. In pseudocode, the index can be represented as follows:</p><pre><code>inverted_index = {
  centroid_1: [vector_1, vector_2, ...],
  centroid_2: [vector_3, vector_4, ...],
  centroid_3: [vector_5, vector_6, ...],
  ...
}
</code></pre>
<p>Here, each centroid serves as a key in the inverted index, and the corresponding value is a list of vectors that belong to the region associated with that centroid. This index structure allows for efficient retrieval of vectors in a region when performing similarity searches.</p><h3 id="searching-the-ivfflat-index-in-pgvector">Searching the IVFFlat index in pgvector</h3><p>Let's imagine we have a query for the nearest neighbors to a vector represented by a question mark, as shown below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---4-1.png" class="kg-image" alt="We want to find nearest neighbors to the vector represented by the question mark" loading="lazy" width="1640" height="1040" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 1600w, https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---4-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">We want to find nearest neighbors to the vector represented by the question mark</span></figcaption></figure><p>To find the approximate nearest neighbors using IVFFlat, the algorithm operates under the assumption that the nearest vectors will be located in the same region as the query vector. Based on this assumption, IVFFlat employs the following steps:</p><ol><li>Calculate the distance between the query vector (red question mark) and each centroid in the index.</li><li>Select the centroid with the smallest distance as the closest centroid to the query (the blue centroid in this example).</li><li>Retrieve the vectors associated with the region corresponding to the closest centroid from the inverted index.</li><li>Compute the distances between the query vector and each of the vectors in the retrieved set.</li><li>Select the K vectors with the smallest distances as the approximate nearest neighbors to the query.<br></li></ol><p>The use of the index in IVFFlat accelerates the search process by restricting the search to the region associated with the closest centroid. This results in a significant reduction in the number of vectors that need to be examined during the search. Specifically, if we have C clusters (centroids), on average, we can reduce the number of vectors to search by a factor of 1/C.</p><h3 id="searching-at-the-edge">Searching at the edge</h3><p>The assumption that the nearest vectors will be found in the same region as the query vector can introduce recall errors in IVFFlat. Consider the following query:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---5-1.png" class="kg-image" alt=" ivfflat can sometimes make errors when searching for nearest neighbors to a point at the edge of two regions of the vector space" loading="lazy" width="1640" height="1040" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 1600w, https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---5-1.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">IVFFlat</span><i><em class="italic" style="white-space: pre-wrap;"> can sometimes make errors when searching for nearest neighbors to a point at the edge of two regions of the vector space</em></i></figcaption></figure><p>From visual inspection, it becomes apparent that one of the light-blue vectors is closer to the query vector than any of the dark-blue vectors, despite the query vector falling within the dark-blue region. This illustrates a potential error in assuming that the nearest vectors will always be found within the same region as the query vector.<br></p><p>To mitigate this type of error, one approach is to search not only the region of the closest centroid but also the regions of the next closest R centroids. This approach expands the search scope and improves the chances of finding the true nearest neighbors. <br></p><p>In pgvector, this functionality is implemented through the `probes` parameter, which specifies the number of centroids to consider during the search, as described below.</p><h2 id="parameters-for-pgvector%E2%80%99s-ivfflat-implementation"><br>Parameters for Pgvector’s IVFFlat Implementation</h2><p>In the implementation of IVFFlat in pgvector, two key parameters are exposed: lists and probes.</p><h3 id="lists-parameter-in-pgvector">Lists parameter in pgvector</h3><p>The <code>lists</code> parameter determines the number of clusters created during index building (It’s called lists because each centroid has a list of vectors in its region). Increasing this parameter reduces the number of vectors in each list and results in smaller regions.<br></p><p>It offers the following trade-offs to consider:</p><ul><li>Higher <code>lists</code> value speeds up queries by reducing the search space during query time.</li><li>However, it also decreases the region size, which can lead to more recall errors by excluding some points.</li><li>Additionally, more distance comparisons are required to find the closest centroid during step one of the query process.<br></li></ul><p>Here are some recommendations for setting the <code>lists</code> parameter:</p><ul><li>For datasets with less than one million rows, use <code>lists =  rows / 1000</code>.</li><li>For datasets with more than one million rows, use <code>lists = sqrt(rows)</code>.</li><li>It is generally advisable to have at least 10 clusters.</li></ul><h3 id="probes-parameter-in-pgvector">Probes parameter in pgvector</h3><p>The probes parameter is a query-time parameter that determines the number of regions to consider during a query. By default, only the region corresponding to the closest centroid is searched. By increasing the probes parameter, more regions can be searched to improve recall at the cost of query speed. <br></p><p>The recommended value for the probes parameter is <code>probes = sqrt(lists)</code>.</p><h2 id="using-ivfflat-in-pgvector">Using IVFFlat in Pgvector</h2><p></p><h3 id="creating-an-index">Creating an index<br></h3><p>When creating an index, it is advisable to have existing data in the table, as it will be utilized by k-means to derive the centroids of the clusters.<br></p><p>The index in pgvector offers three different methods to calculate the distance between vectors: L2, inner product, and cosine. It is essential to select the same method for both the index creation and query operations. The following table illustrates the query operators and their corresponding index methods:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="208"><col width="209"><col width="207"></colgroup><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Distance type</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Query operator</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Index method</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">L2 / Euclidean</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&lt;-&gt;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">vector_l2_ops</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Negative Inner product</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&lt;#&gt;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">vector_ip_ops</span></p></td></tr><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Cosine</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">&lt;=&gt;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">vector_cosine_ops</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p><strong>Note</strong>: OpenAI <a href="https://platform.openai.com/docs/guides/embeddings/limitations-risks">recommends</a> cosine distance for its embeddings.</p><p>To create an index in pgvector using IVFFlat, you can use a statement using the following form:</p><pre><code class="language-SQL">CREATE INDEX ON &lt;table name&gt; USING ivfflat (&lt;column name&gt; &lt;index method&gt;) WITH (lists = &lt;lists parameter&gt;);
</code></pre>
<p>Replace <code>&lt;table name&gt;</code> with the name of your table and <code>&lt;column name&gt;</code> with the name of the column that contains the vector type.</p><p>For example, if our table is named <code>embeddings</code> and our embedding vectors are in a column named <code>embedding</code>, we can create an IVFFlat index as follows:</p><pre><code class="language-SQL">CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

</code></pre>
<p>Here’s a simple Python function that you can use to create an IVFFlat index with the correct parameters for <code>lists</code> and <code>probes</code> as discussed above:</p><pre><code class="language-Python">def create_ivfflat_index(conn, table_name, column_name, query_operator="&lt;=&gt;"): 
    index_method = "invalid"
    if query_operator == "&lt;-&gt;":
        index_method = "vector_l2_ops"
    elif query_operator == "&lt;#&gt;":
        index_method = "vector_ip_ops"
    elif query_operator == "&lt;=&gt;":
        index_method = "vector_cosine_ops"
    else:
        raise ValueError(f"unrecognized operator {query_operator}")

    with conn.cursor() as cur:
        cur.execute(f"SELECT COUNT(*) as cnt FROM {table_name};")
        num_records = cur.fetchone()[0]

        num_lists = num_records / 1000
        if num_lists &lt; 10:
            num_lists = 10
        if num_records &gt; 1000000:
            num_lists = math.sqrt(num_records)

        cur.execute(f'CREATE INDEX ON {table_name} USING ivfflat ({column_name} {index_method}) WITH (lists = {num_lists});')
        conn.commit()
</code></pre>
<h3 id="querying">Querying</h3><p>An index can be used whenever there is an ORDER BY of the form <code>column &lt;query operator&gt; &lt;some pseudo-constant vector&gt;</code> along with a LIMIT k;<br></p><p><strong>Some examples</strong><br><br>Get the closest two vectors to a constant vector:</p><pre><code class="language-SQL">SELECT * FROM my_table ORDER BY embedding_column &lt;=&gt; '[1,2]' LIMIT 2;
</code></pre>
<p>This is a common usage pattern in retrieval augmented generation using LLMs, where we find the embedding vectors that are closest in semantic meaning to the user’s query. In that case, the constant vector would be the embedding vector representing the user’s query. </p><p>You can see an example of this in our guide to <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/">creating, storing, and querying OpenAI embeddings with pgvector</a>, where we use this Python function to find the three most similar documents to a given user query from our database:</p><pre><code class="language-Python"># Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN &lt;=&gt; operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding &lt;=&gt; %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs
</code></pre>
<p>Get the closest vector to some row:</p><pre><code class="language-SQL">SELECT * FROM my_table WHERE id != 1 ORDER BY embedding_column &lt;=&gt; (SELECT embedding_column FROM my_table WHERE id = 1) LIMIT 2;
</code></pre>
<p><strong>Tip:</strong> PostgreSQL's ability to use an index does not guarantee its usage! The cost-based planner evaluates query plans and may determine that a sequential scan or a different index is more efficient for a specific query. You can use the EXPLAIN command to see the chosen execution plan. To test the viability of using an index, you can modify planner costing parameters until you achieve the desired plan. For small datasets, setting <code>enable_seqscan = 0</code> can be especially advantageous for testing viability as it avoids sequential scans.<br></p><p>To adjust the probes parameter, you can set the <code>ivfflat.probes</code> variable. For instance, to set it to '5', execute the following statement before running the query:</p><pre><code class="language-sql">SET ivfflat.probes = 5;
</code></pre>
<h3 id="dealing-with-data-changes">Dealing with data changes</h3><p>As your data evolves with inserts, updates, and deletes, the IVFFlat index will be updated accordingly. New vectors will be added to the index, while no longer-used vectors will be removed. </p><p><strong>However, the clustering centroids will not be updated</strong>. Over time, this can result in a situation where the initial clustering, established during index creation, no longer accurately represents the data. This can be visualized as follows:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---6.png" class="kg-image" alt="As data gets inserted or deleted from the index, if the index is not rebuilt, the ivfflat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well" loading="lazy" width="1640" height="1040" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2023/06/nearest-neighbor-pgvector-diagram---6.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2023/06/nearest-neighbor-pgvector-diagram---6.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2023/06/nearest-neighbor-pgvector-diagram---6.png 1600w, https://timescale.ghost.io/blog/content/images/2023/06/nearest-neighbor-pgvector-diagram---6.png 1640w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">As data gets inserted or deleted from the index, if the index is not rebuilt, the IVFFlat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well</span></figcaption></figure><p>To address this issue, the only solution is to rebuild the index.<br></p><p>Here are two important takeaways from this issue:</p><ul><li>Build the index once you have all the representative data you want to reference in it. This is unlike most indexes, which can be built on an empty table.</li><li>It is advisable to periodically rebuild the index.<br></li></ul><p>When rebuilding the index, it is highly recommended to use the <code>CONCURRENTLY</code>option to avoid interfering with ongoing operations.<br></p><p>Thus, to rebuild the index run the following in a cron job:</p><pre><code class="language-SQL">REINDEX INDEX CONCURRENTLY &lt;index name&gt;;
</code></pre>
<h2 id="summing-it-up">Summing It Up</h2><p>The IVFFlat algorithm in pgvector provides an efficient solution for approximate nearest neighbor search over high-dimensional data like embeddings. It works by clustering similar vectors into regions and building an inverted index to map each region to its vectors. This allows queries to focus on a subset of the data, enabling fast search. By tuning the lists and probes parameters, IVFFlat can balance speed and accuracy for a dataset.  </p><p></p><p>Overall, IVFFlat gives PostgreSQL the ability to perform fast semantic similarity search over complex data. With simple queries, applications can find the nearest neighbors to a query vector among millions of high-dimensional vectors. For natural language processing, information retrieval, and more, IVFFlat is a compelling solution. By understanding how IVFFlat divides the vector space into regions and builds its inverted index, you can optimize its performance for your needs and build powerful applications on top of it.</p><p>✨<strong>Resources for further learning:</strong> Now that you know more about the IVFFlat index in pgvector, here are some resources to further your learning journey:&nbsp;</p><ul><li>Learn about other PostgreSQL indexes for vector search, like <a href="https://timescale.ghost.io/blog/vector-database-basics-hnsw/" rel="noreferrer">HNSW</a>.</li><li>Learn how we made <a href="https://timescale.ghost.io/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/" rel="noreferrer">PostgreSQL as fast as Pinecone for vector data</a>.</li><li>Follow our tutorial on <a href="https://timescale.ghost.io/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/" rel="noreferrer">creating, storing, and querying OpenAI embeddings using PostgreSQL as a vector database</a>. <a href="https://timescale.ghost.io/blog/how-to-build-llm-applications-with-pgvector-vector-store-in-langchain/" rel="noreferrer"><u>Learn how </u>to use pgvector as a vector store in LangChain</a>. <a href="https://timescale.ghost.io/blog/refining-vector-search-queries-with-time-filters-in-pgvector-a-tutorial/" rel="noreferrer">Or see how you can refine vector search queries using time filters in pgvector with a single SQL query</a>.</li></ul><p>And if you’re looking for a production-ready PostgreSQL database for your AI application’s vector, relational, and time-series data, <a href="https://www.timescale.com/ai" rel="noreferrer"><u>try Timescale Cloud</u></a>.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL as a Vector Database: A Pgvector Tutorial]]></title>
            <description><![CDATA[Vector databases add organizational intelligence to AI. Learn how to use PostgreSQL as a vector database for retrieval-augmented generation with pgvector.]]></description>
            <link>https://www.tigerdata.com/blog/postgresql-as-a-vector-database-using-pgvector</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgresql-as-a-vector-database-using-pgvector</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[pgvector]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[OpenAI]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Wed, 21 Jun 2023 18:22:10 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2023/06/Postgres-vector-database-and-OpenAI-embeddings-blog--1-.png">
            </media:content>
            <content:encoded><![CDATA[<p>Vector databases enable efficient storage and searching of vector data. They are essential for developing and maintaining AI applications using large language models (LLMs).</p><p>With some help from the <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer"><u>pgvector extension</u></a>, you can leverage PostgreSQL as a vector database to store and query<a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings?ref=timescale.com"> <u>OpenAI embeddings</u></a>. OpenAI embeddings are a type of data representation (in the shape of vectors, i.e., lists of numbers) used to measure the similarity of text strings for OpenAI’s models.</p><p>In this article, we work through the example of creating a chatbot to answer questions about Tiger Data (creators of TimescaleDB). The chatbot will be trained on content from the <a href="https://timescale.ghost.io/blog/tag/dev-q-a/"><u>Tiger Data Developer Q&amp;A blog posts</u></a>. This example will illustrate the key concepts for creating, storing, and querying OpenAI embeddings with PostgreSQL and pgvector.</p><p>This example has three parts:</p><ul><li>Part 1: How to create embeddings from content using the<a href="https://platform.openai.com/docs/api-reference?ref=timescale.com"> <u>OpenAI API</u></a>.</li><li>Part 2: How to use PostgreSQL as a vector database and store OpenAI embedding vectors using pgvector.</li><li>Part 3: How to use embeddings retrieved from a vector database to augment LLM generation.</li></ul><p>One could think of this as a “hello world” tutorial for building a chatbot that can reference a company knowledge base or developer docs.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">✨</div><div class="kg-callout-text"><b><strong style="white-space: pre-wrap;">Jupyter Notebook and Code:</strong></b> You can find all the code used in this tutorial in a Jupyter Notebook, as well as sample content and embeddings on the Tiger Data GitHub: <a href="https://github.com/timescale/vector-cookbook/tree/main/openai_pgvector_helloworld">timescale/vector-cookbook</a>. We recommend cloning the repo and following along by executing the code cells as you read through the tutorial.</div></div><h2 id="the-big-picture-openai-embeddings">The Big Picture: OpenAI Embeddings</h2><p>Foundational models of AI (e.g., GPT-3 or GPT-4) may be missing some information needed to provide accurate answers to certain specific questions. That’s because relevant information was not in the dataset used to train the model. (For example, the information is stored in private documents or only became available recently.) This lack of data may make these models unsuitable for use as a chatbot in specific information banks.</p><p><a href="https://www.promptingguide.ai/techniques/rag?ref=timescale.com"><u>Retrieval-augmented generation</u></a> (RAG) gives a simple solution; it provides additional context to the foundational model in the prompt. </p><p>This technique is powerful—it allows you to “teach” foundational models about things only you know about and use that to create a ChatGPT++ experience for your users!</p><p>But what context should you provide to the model? If you have a library of information, how can you determine what’s relevant to a given question? That is what <a href="https://www.timescale.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">embeddings</a> are for. <a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings?ref=timescale.com"><u>OpenAI embeddings</u></a> are a mathematical representation of the semantic meaning of a piece of text that allows for <em>similarity search</em>.</p><p>With this representation, when you get a user question and calculate its embedding, you can use a similarity search against data embeddings in your library to find the most relevant information. But that requires having an embedding representation of your library.&nbsp;&nbsp;</p><h3 id="what-is-a-vector-database">What is a vector database?</h3><p>A <a href="https://www.timescale.com/blog/how-to-choose-a-vector-database"><u>vector database</u></a> is a database that can handle vector data. Vector databases are useful for:</p><ul><li><a href="https://www.tigerdata.com/learn/vector-search-vs-semantic-search" rel="noreferrer"><strong>Semantic search</strong></a><strong>:</strong> Vector databases facilitate semantic search, which considers the context or meaning of search terms rather than just exact matches. They are useful for recommendation systems, content discovery, and question-answering systems.</li><li><strong>Efficient similarity search:</strong> Vector databases are designed for efficient high-dimensional nearest neighbor search, a task where traditional relational databases struggle.</li><li><strong>Machine learning:</strong> Vector databases store and search embeddings created by machine-learning models. This feature aids in finding items semantically similar to a given item.</li><li><strong>Multimedia data handling:</strong> Vector databases also excel in working with multimedia data (images, audio, video) by converting them into high-dimensional vectors for efficient similarity search.</li><li><strong>NLP and data combination:</strong> In natural language processing (NLP), vector databases store high-dimensional vectors representing words, sentences, or documents. They also allow a combination of traditional SQL queries with similarity searches, accommodating both structured and unstructured data.</li></ul><p>We’ll use PostgreSQL with the <a href="https://github.com/pgvector/pgvector"><u>pgvector extension</u></a> installed as our vector database. Pgvector extends PostgreSQL to handle vector data types and vector similarity search, like <a href="https://en.wikipedia.org/wiki/Nearest_neighbor_search"><u>nearest neighbor search</u></a>, which we’ll use to find the k most related embeddings in our database for a given user prompt.</p><h2 id="using-pgvector-for-a-postgresql-vector-database">Using Pgvector for a PostgreSQL Vector Database</h2><p><a href="https://www.timescale.com/learn/using-pgvector-with-python"><u>Pgvector</u></a> is an open-source extension for PostgreSQL that enables storing and searching over machine learning-generated embeddings. It provides different capabilities that allow users to identify exact and approximate nearest neighbors. Pgvector is designed to work seamlessly with other PostgreSQL features, including indexing and querying.</p><p>Now we’re ready to start building our chatbot!</p><h3 id="why-use-pgvector-as-a-vector-database">Why use pgvector as a vector database?</h3><p>Here are five reasons <a href="https://www.tigerdata.com/blog/postgres-for-everything" rel="noreferrer">why <u>PostgreSQL</u></a> is a good choice for storing and handling vector data:</p><ul><li><strong>Integrated solution:</strong> By using PostgreSQL as a vector database, you keep your data in one place. This can simplify your architecture by reducing the need for multiple databases or additional services.</li><li><strong>Enterprise-level robustness and operations:</strong> With a 30-year pedigree, PostgreSQL provides world-class data integrity, operations, and robustness. This includes backups, streaming replication, role-based and row-level security, and ACID compliance.</li><li><strong>Full-featured SQL:</strong> PostgreSQL supports a rich set of SQL features, including joins, subqueries, window functions, and more. This allows for powerful and complex queries that can include both traditional relational data and vector data. It also integrates with a plethora of existing data science and data analysis tools.</li><li><strong>Scalability and performance:</strong> PostgreSQL is known for its robustness and ability to handle large datasets. Using it as a vector database allows you to leverage these characteristics for vector data as well.</li><li><strong>Open source:</strong> PostgreSQL is open source, which means it's free to download and use, and you can modify it to suit your needs. It also means that it benefits from the collective input of developers all over the world, which often results in high-quality, secure, and up-to-date software. PostgreSQL has a large and active community, so help is readily available. There are many resources, including documentation, tutorials, forums, and more, to help you troubleshoot and optimize your PostgreSQL database.</li></ul><h2 id="setting">setting</h2><ul><li>Install Python.</li><li>Install and configure a Python virtual environment. We recommend <a href="https://github.com/pyenv/pyenv">Pyenv</a>.</li><li>Install the requirements for this notebook using the following command:</li></ul><pre><code class="language-Python">pip install -r requirements.txt
</code></pre>
<p>Import all the packages we will be using:</p><pre><code class="language-Python">import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
</code></pre>
<p>You’ll need to <a href="https://platform.openai.com/overview">sign up for an OpenAI Developer Account</a> and create an OpenAI API Key – we recommend getting a paid account to avoid rate limiting and setting a spending cap so that you avoid any surprises with bills.</p><p>Once you have an OpenAI API key, it’s a <a href="https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety">best practice</a> to store it as an environment variable and then have your Python program read it.</p><pre><code class="language-Python">#First, run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...


# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key  = os.environ['OPENAI_API_KEY'] 
</code></pre>
<h2 id="part-1-create-embeddings-for-your-postgresql-vector-database">Part 1: Create Embeddings for Your PostgreSQL Vector Database</h2><p><a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings">Embeddings</a> measure how related text strings are. First, we'll create embeddings using the OpenAI API on some text we want the LLM to answer questions on.</p><p>In this example, we'll use content from the Tiger Data blog, specifically from the <a href="https://timescale.ghost.io/blog/tag/dev-q-a/">Developer Q&amp;A section</a>, which features posts by Tiger Data users talking about their real-world use cases.</p><p>You can replace this blog data with any text you want to embed, such as your own company blog, developer documentation, internal knowledge base, or any other information you’d like to have a “ChatGPT-like” experience over.</p><pre><code class="language-Python"># Load your CSV file into a pandas DataFrame
df = pd.read_csv('blog_posts_data.csv')
df.head()
</code></pre>
<p>The output looks like this:</p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="23"><col width="162"><col width="146"><col width="262"></colgroup><tbody><tr style="height:16.5pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><br></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Title</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Content</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">URL</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">0</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How to Build a Weather Station With Elixir, Ne...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-to-build-a-...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">1</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">CloudQuery on Using PostgreSQL for Cloud Asset...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/cloudquery-on-u...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">2</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How a Data Scientist Is Building a Time-Series...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-a-data-scie...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">3</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How Conserv Safeguards History: Building an En...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-conserv-saf...</span></p></td></tr><tr style="height:38.25pt"><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:middle;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">4</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">How Messari Uses Data to Open the Cryptoeconom...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">This is an installment of our “Community Membe...</span></p></td><td style="border-left:solid #000000 0.75pt;border-right:solid #000000 0.75pt;border-bottom:solid #000000 0.75pt;border-top:solid #000000 0.75pt;vertical-align:top;padding:3pt 6pt 3pt 6pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">https://www.timescale.com/blog/how-messari-use...</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<h3 id="11-calculate-the-cost-of-embedding-data">1.1 Calculate the cost of embedding data</h3><p>It's usually a good idea to calculate how much creating embeddings for your selected content will cost. We provide a number of helper functions to calculate a cost estimate before creating the embeddings to help us avoid surprises.</p><p>For OpenAI, you are charged on a per-token basis for embeddings created. The total cost for the blog posts we want to embed will be less than $0.01, thanks to OpenAI’s small text embedding model, <a href="https://openai.com/index/new-embedding-models-and-api-updates/"><u>text-embedding-3-small</u></a>. This model boasts not only stronger performance but also 5X cost reduction compared to its predecessor, <a href="https://openai.com/blog/new-and-improved-embedding-model?ref=timescale.com"><u>text-embedding-ada-002</u></a>.</p><pre><code class="language-Python"># Helper functions to help us create the embeddings

# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -&gt; int:
    if not string:
        return 0
    # Returns the number of tokens in a text string
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Helper function: calculate length of essay
def get_essay_length(essay):
    word_list = essay.split()
    num_words = len(word_list)
    return num_words

# Helper function: calculate cost of embedding num_tokens
# Assumes we're using the text-embedding-ada-002 model
# See https://openai.com/pricing
def get_embedding_cost(num_tokens):
    return num_tokens/1000*0.0002

# Helper function: calculate total cost of embedding all content in the dataframe
def get_total_embeddings_cost():
    total_tokens = 0
    for i in range(len(df.index)):
        text = df['content'][i]
        token_len = num_tokens_from_string(text)
        total_tokens = total_tokens + token_len
    total_cost = get_embedding_cost(total_tokens)
    return total_cost

</code></pre>
<pre><code class="language-Python"># quick check on total token amount for price estimation
total_cost = get_total_embeddings_cost()
print("estimated price to embed this content = $" + str(total_cost))

</code></pre>
<h3 id="12-create-smaller-chunks-of-content">1.2 Create smaller chunks of content</h3><p>The OpenAI API has a maximum token <a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings"><u>limit</u></a> that it can create an embedding for in a single request: 8,191 to be specific. To get around this limit, we'll break up our text into smaller chunks. Generally, it's a best practice to “chunk” the documents you want to create embeddings into groups of a fixed token size.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png" class="kg-image" alt="A table with the performance eval of the OpenAI embedding models" loading="lazy" width="1748" height="546" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 1600w, https://timescale.ghost.io/blog/content/images/2024/10/Making-PostgreSQL-a-Vector-Database-pgvector-tutorial_embedding-models.png 1748w" sizes="(min-width: 720px) 720px"></figure><p>The precise number of tokens to include in a chunk depends on your use case and your model’s context window—the number of input tokens it can handle in a prompt.</p><p>For our purposes, we'll aim for chunks of around 512 tokens each. Chunking text up is a complex topic worthy of its own blog post. We’ll illustrate a simple method we found to work well below. &nbsp;If you want to read about other approaches, we recommend <a href="https://python.langchain.com/docs/how_to/#text-splitters"><u>this section</u></a> of the LangChain docs.</p><p><strong>Note:</strong> If you prefer to skip this step, you can use the provided file: <a href="https://github.com/timescale/vector-cookbook/tree/main/openai_pgvector_helloworld">blog_data_and_embeddings.csv</a>, which contains the data and embeddings that you'll generate in this step.</p><p>The code below creates a new list of our blog content while retaining the metadata associated with the text, such as the blog title and URL that the text is associated with.</p><pre><code class="language-Python">
# Create new list with small content chunks to not hit max token limits
# Note: the maximum number of tokens for a single request is 8191
# https://platform.openai.com/docs/guides/embeddings/embedding-models

# list for chunked content and embeddings
new_list = []
# Split up the text into token sizes of around 512 tokens
for i in range(len(df.index)):
    text = df['content'][i]
    token_len = num_tokens_from_string(text)
    if token_len &lt;= 512:
        new_list.append([df['title'][i], df['content'][i], df['url'][i], token_len])
    else:
        # add content to the new list in chunks
        start = 0
        ideal_token_size = 512
        # 1 token ~ 3/4 of a word
        ideal_size = int(ideal_token_size // (4/3))
        end = ideal_size
        #split text by spaces into words
        words = text.split()

        #remove empty spaces
        words = [x for x in words if x != ' ']

        total_words = len(words)
        
        #calculate iterations
        chunks = total_words // ideal_size
        if total_words % ideal_size != 0:
            chunks += 1
        
        new_content = []
        for j in range(chunks):
            if end &gt; total_words:
                end = total_words
            new_content = words[start:end]
            new_content_string = ' '.join(new_content)
            new_content_token_len = num_tokens_from_string(new_content_string)
            if new_content_token_len &gt; 0:
                new_list.append([df['title'][i], new_content_string, df['url'][i], new_content_token_len])
            start += ideal_size
            end += ideal_size

</code></pre>
<p>Now that our text is chunked better, we can create embeddings for each chunk of text using the OpenAI API.</p><p>We’ll use this helper function to create embeddings for a piece of text:</p><pre><code class="language-Python">openai_client = openai.OpenAI()

# Helper function: get embeddings for a text
def get_embeddings(text):
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input = text.replace("\n"," ")
    )
    return response.data[0].embedding</code></pre>
<p><br>And then create embeddings for each chunk of content:</p><pre><code class="language-Python"># Create embeddings for each piece of content
for i in range(len(new_list)):
   text = new_list[i][1]
   embedding = get_embeddings(text)
   new_list[i].append(embedding)

# Create a new dataframe from the list
df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url', 'tokens', 'embeddings'])
df_new.head()

</code></pre>
<p>The new data frame should look like this:</p>
<!--kg-card-begin: html-->
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0pky"></th>
    <th class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Title</span></th>
    <th class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Content</span></th>
    <th class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">URL</span></th>
    <th class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Tokens</span></th>
    <th class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Embeddings</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">0</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">How to Build a Weather Station With Elixir, Ne...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">This is an installment of our “Community Membe...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/how-to-build-a-...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">501</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.021440856158733368, 0.02200360782444477, -0...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">1</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">How to Build a Weather Station With Elixir, Ne...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">capture weather and environmental data. In all...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/how-to-build-a-...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">512</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.016165969893336296, 0.011341351084411144, 0...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">2</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">How to Build a Weather Station With Elixir, Ne...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">command in their database migration:SELECT cre...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/how-to-build-a-...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">374</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.022517921403050423, -0.0019158280920237303,...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">3</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">CloudQuery on Using PostgreSQL for Cloud Asset...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">This is an installment of our “Community Membe...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/cloudquery-on-u...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">519</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.009028822183609009, -0.005185891408473253, ...</span></td>
  </tr>
  <tr>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">4</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">CloudQuery on Using PostgreSQL for Cloud Asset...</span></td>
    <td class="tg-0pky"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">Architecture with CloudQuery SDK- Writing plug...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">https://www.timescale.com/blog/cloudquery-on-u...</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">511</span></td>
    <td class="tg-0lax"><span style="font-weight:400;font-style:normal;text-decoration:none;color:#000;background-color:transparent">[0.02050386555492878, 0.010169642977416515, 0....</span></td>
  </tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p><br>As an optional but recommended step, you can save the original blog content along with associated embeddings in a CSV file for reference later on so that you don't have to recreate embeddings if you want to reference it in another project.</p><pre><code class="language-Python"># Save the dataframe with embeddings as a CSV file
df_new.to_csv('blog_data_and_embeddings.csv', index=False)
</code></pre>
<h2 id="pro-tip-automating-embedding-creation-with-pgai-vectorizer">Pro Tip: Automating Embedding Creation with pgai Vectorizer</h2><p>In the section above, we showed how to manually create and manage embeddings in your own data pipeline – chunking content, calling the OpenAI API, and storing the results. While this approach helps you understand the fundamentals, in production, you may want to automate this process completely. Let’s look at how <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer.md"><u>pgai Vectorizer</u></a> can handle this entire pipeline for you!&nbsp;</p><p>Managing embeddings in production involves several challenges: keeping embeddings in sync with changing content, handling API failures, and optimally chunking text.&nbsp;</p><p><a href="https://www.tigerdata.com/blog/pgai-giving-postgresql-developers-ai-engineering-superpowers" rel="noreferrer">pgai</a> Vectorizer automates this entire process directly in PostgreSQL - similar to how PostgreSQL automatically <a href="https://docs.timescale.com/use-timescale/latest/schema-management/indexing?ref=timescale.com"><u>maintains indexes</u></a> for your tables.</p><h3 id="setting-up-pgai-vectorizer">Setting Up pgai Vectorizer</h3><p>The setup process differs depending on whether you’re using Tiger Cloud (formerly Timescale Cloud) or hosting PostgreSQL yourself.&nbsp;</p><p><strong>On Tiger Cloud</strong></p><pre><code class="language-python">-- 1. Store your OpenAI API key securely in Timescale Cloud
-- 2. Navigate to Project Settings &gt; AI Model API Keys in the Timescale Console
-- 3. The key is stored securely and not in your database
-- 4. Create the extensions
CREATE EXTENSION IF NOT EXISTS ai;</code></pre><p><strong>For self-hosted PostgreSQL</strong></p><pre><code class="language-bash">export OPENAI_API_KEY="your-api-key-here"

# Start the vectorizer worker
vectorizer-worker --connection="postgres://user:password@host:port/dbname"</code></pre><h3 id="creating-your-first-vectorizer">Creating Your First Vectorizer</h3><p>Instead of manually creating embeddings using Python, you can define a <em>vectorizer</em> that automatically generates and maintains embeddings for your content:</p><pre><code class="language-python">SELECT ai.create_vectorizer( 
   'blog_posts'::regclass,
    destination =&gt; 'blog_embeddings',
    embedding =&gt; ai.embedding_openai('text-embedding-3-small', 768),
    chunking =&gt; ai.chunking_recursive_character_text_splitter('content'),
    -- Pro tip: Add blog title as context to each chunk
    formatting =&gt; ai.formatting_python_template('$title: $chunk')
);</code></pre><p>This single SQL command:</p><ol><li>Automatically chunks your blog content</li><li>Creates embeddings for each chunk using OpenAI's API</li><li>Maintains embeddings as your content changes</li><li>Creates a view that joins your content with its embeddings</li></ol><h3 id="searching-with-vectorizer">Searching with Vectorizer</h3><p>You can then search your content the same way as before:</p><pre><code class="language-Python">SELECT 
   chunk,
   embedding &lt;=&gt; ai.openai_embed('text-embedding-3-small', 'How is Timescale used in IoT?') as distance
FROM blog_embeddings
ORDER BY distance
LIMIT 3;</code></pre><p>Vectorizer runs automatically every five minutes on <a href="https://console.cloud.timescale.com/signup?ref=timescale.com"><u>Tiger Cloud</u></a>, handling retries and keeping your embeddings up to date. For more details on setup and advanced features like <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer.md#monitor-a-vectorizer"><u>monitoring the Vectorizer</u></a>, see our pgai Vectorizer <a href="https://github.com/timescale/pgai/blob/main/docs/vectorizer.md"><u>documentation</u></a>.&nbsp;</p><h3 id="further-reading-on-rag">Further Reading on RAG</h3><p>The accuracy and cost of your RAG application depends heavily on implementation choices such as the embedding model selection to chunking strategies.&nbsp;</p><p>Here are more blog posts to help you build effective RAG applications with PostgreSQL:</p><ol><li><a href="https://timescale.ghost.io/blog/vector-databases-are-the-wrong-abstraction?ref=timescale.com"><u>Vector Databases Are the Wrong Abstraction</u></a> – learn why general-purpose databases with vector extensions like <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>pgvectorscale</u></a> often provide better solutions than specialized vector databases</li><li><a href="https://timescale.ghost.io/blog/which-rag-chunking-and-formatting-strategy-is-best?ref=timescale.com"><u>Which RAG Chunking and Formatting Strategy Is Best?</u></a> – Explore different approaches to chunking and formatting your content for optimal retrieval-augmented generation (RAG) performance</li><li><a href="https://www.tigerdata.com/blog/which-openai-embedding-model-is-best" rel="noreferrer"><u>Which OpenAI Embedding Model Is Best?</u></a> - Compare OpenAI's embedding models to choose the right one for your use case</li></ol><h2 id="part-2-store-embeddings-in-a-postgresql-vector-database-using-pgvector">Part 2: Store Embeddings in a PostgreSQL Vector Database Using Pgvector</h2><p>Now that we have created embedding vectors for our blog content, the next step is to store the embedding vectors in a vector database to help us perform a fast search over many vectors.</p><h3 id="21-create-a-postgresql-database-and-install-pgvector">2.1 Create a PostgreSQL database and <a href="https://www.tigerdata.com/learn/postgresql-extensions-pgvector" rel="noreferrer">install pgvector</a></h3><p>First, we’ll create a PostgreSQL database. You can <a href="https://docs.timescale.com/getting-started/latest/services/" rel="noreferrer">create a cloud PostgreSQL database</a> in minutes for free on <a href="https://console.cloud.timescale.com/signup">Tiger Cloud</a> or use a local PostgreSQL database for this step. </p><p>Once you’ve created your PostgreSQL database, export your connection string as an environment variable, and just like the OpenAI API key, we’ll read it into our Python program from the environment file:</p><pre><code class="language-Python"># Timescale database connection string
# Found under "Service URL" of the credential cheat-sheet or "Connection Info" in the Timescale console
# In terminal, run: export TIMESCALE_CONNECTION_STRING=postgres://&lt;fill in here&gt;

connection_string  = os.environ['TIMESCALE_CONNECTION_STRING']

</code></pre>
<p>We then connect to our database using the popular <a href="https://pypi.org/project/psycopg2/?ref=timescale.com"><u>psycopg2</u></a> Python library and install the pgvector and <a href="https://github.com/timescale/pgvectorscale?tab=readme-ov-file#installation"><u>pgvectorscale</u></a> extension (which provides powerful filtering and indexing capabilities ) as follows:</p><pre><code class="language-Python"># Connect to PostgreSQL database in Timescale using connection string
conn = psycopg2.connect(connection_string)
cur = conn.cursor()

#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
conn.commit()

#install pgvectorscale
cur.execute("CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;")
conn.commit()
</code></pre>
<h3 id="22-connect-to-and-configure-your-vector-database">2.2 Connect to and configure your vector database</h3><p>Once we’ve installed pgvector, we use the <a href="https://github.com/pgvector/pgvector-python#psycopg-2">register_vector()</a> command to register the vector type with our connection:</p><pre><code class="language-Python"># Register the vector type with psycopg2
register_vector(conn)
</code></pre>
<p>Once we’ve connected to the database, let’s create a table that we’ll use to store embeddings along with metadata. Our table will look as follows:<br></p>
<!--kg-card-begin: html-->
<table style="border:none;border-collapse:collapse;"><colgroup><col width="44"><col width="48"><col width="60"><col width="87"><col width="84"><col width="112"></colgroup><tbody><tr style="height:0pt"><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">id</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">title&nbsp;</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">url</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">content</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">tokens</span></p></td><td style="border-left:solid #000000 1pt;border-right:solid #000000 1pt;border-bottom:solid #000000 1pt;border-top:solid #000000 1pt;vertical-align:top;padding:5pt 5pt 5pt 5pt;overflow:hidden;overflow-wrap:break-word;"><p dir="ltr" style="line-height:1.2;margin-top:0pt;margin-bottom:0pt;"><span style="font-size:11pt;font-family:Arial;color:#000000;background-color:transparent;font-weight:400;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">embedding</span></p></td></tr></tbody></table>
<!--kg-card-end: html-->
<p></p><ul><li><code>Id</code> represents the unique ID of each <a href="https://www.tigerdata.com/blog/a-beginners-guide-to-vector-embeddings" rel="noreferrer">vector embedding</a> in the table.</li><li><code>title</code> is the blog title from which the content associated with the embedding is taken.</li><li><code>url</code> is the blog URL from which the content associated with the embedding is taken.</li><li><code>content</code> is the actual blog content associated with the embedding.</li><li><code>tokens</code> is the number of tokens the embedding represents.</li><li><code>embedding</code> is the vector representation of the content.<br></li></ul><p>One advantage of using PostgreSQL as a vector database is that you can easily store metadata and embedding vectors in the same database, which is helpful for supplying the user-relevant information related to the response they receive, like links to read more or specific parts of a blog post that are relevant to them.</p><pre><code class="language-Python"># Create table to store embeddings and metadata
table_create_command = """
CREATE TABLE embeddings (
            id bigserial primary key, 
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()

</code></pre>
<h3 id="23-ingest-and-store-vector-data-into-postgresql-using-pgvector">2.3 Ingest and <a href="https://www.tigerdata.com/learn/vector-store-vs-vector-database" rel="noreferrer">store vector</a> data into PostgreSQL using pgvector</h3><p>Now that we’ve created the database and created the table to house the embeddings and metadata, the final step is to insert the embedding vectors into the database. </p><p>For this step, it’s a best practice to batch insert the embeddings rather than insert them one by one.<br></p><pre><code class="language-Python">#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()

</code></pre>
<p>Let’s sanity check by running some simple queries against our newly inserted data:</p><pre><code class="language-Python">cur.execute("SELECT COUNT(*) as cnt FROM embeddings;")
num_records = cur.fetchone()[0]
print("Number of vector records in table: ", num_records,"\n")
# Correct output should be 129

</code></pre>
<pre><code class="language-Python"># print the first record in the table, for sanity-checking
cur.execute("SELECT * FROM embeddings LIMIT 1;")
records = cur.fetchall()
print("First record in table: ", records)
</code></pre>
<h3 id="24-index-your-data-for-faster-retrieval">2.4 Index your data for faster retrieval</h3><p>In this example, we only have 129 embedding vectors, so searching through all of them is blazingly fast. But for larger datasets, you need to create indexes to speed up searching for similar embeddings, so we include the code to build the index for illustrative purposes. </p><p>While pgvector&nbsp;supports the <a href="https://www.tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector-and-how-do-they-work" rel="noreferrer"><u>IVFFLAT</u></a> and <a href="https://www.tigerdata.com/learn/vector-database-basics-hnsw" rel="noreferrer"><u>HNSW</u></a> index types for approximate nearest neighbor (ANN) search, <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> offers a more cost-efficient and powerful index type for pgvector data: <a href="https://timescale.ghost.io/blog/pgvector-is-now-as-fast-as-pinecone-at-75-less-cost/"><u>StreamingDiskANN</u></a>, which we use here. </p><p>You always want to build this index <strong>after</strong> you have inserted the data, as the index needs to discover clusters in your data to be effective, and it does this only when first building the index. </p><p>The StreamingDiskANN index has tunable parameters depending on your goal, whether it is changing indexing operations or querying operations. In our case, we use the default values of the parameters. You can read more about <a href="https://github.com/timescale/pgvectorscale?tab=readme-ov-file#tuning"><u>tuning here</u></a>.</p><pre><code class="language-Python"># Create an index on the data for faster retrieval
cur.execute('CREATE INDEX embedding_idx ON embeddings USING diskann (embedding);')
conn.commit()
</code></pre>
<h2 id="part-3-nearest-neighbor-search-using-pgvector">Part 3: Nearest Neighbor Search Using pgvector</h2><p>Given a user question, we’ll perform the following steps to use information stored in the vector database to answer their question using Retrieval Augmented Generation:</p><ol><li>Create an embedding vector for the user question.</li><li>Use pgvector to perform a vector similarity search and retrieve the <code>k</code> nearest neighbors to the question embedding from our embedding vectors representing the blog content. In our example, we’ll use k=3, finding the three most similar embedding vectors and associated content.</li><li>Supply the content retrieved from the database as additional context to the model and ask it to perform a completion task to answer the user question.</li></ol><h3 id="31-define-a-question-you-want-to-answer">3.1 Define a question you want to answer</h3><p>First, we’ll define a sample question that a user might want to answer about the blog posts stored in the database.</p><pre><code class="language-Python"># Question about Timescale we want the model to answer
input = "How is Timescale used in IoT?"
</code></pre>
<p>Since TimescaleDB is <a href="https://timescale.ghost.io/blog/visualizing-iot-data-at-scale-with-hopara-and-timescaledb/">popular for IoT sensor data</a>, a user might want to learn specifics about how they can leverage it for that use case.</p><h3 id="32-find-the-most-relevant-content-in-the-database">3.2 Find the most relevant content in the database</h3><p>Here’s the function we use to find the three nearest neighbors to the user question. Note it uses pgvector’s <code>&lt;=&gt;</code> operator, which finds the <a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine distance</a> (also known as <a href="https://www.tigerdata.com/learn/understanding-cosine-similarity" rel="noreferrer">Cosine similarity</a>) between two embedding vectors.  </p><pre><code class="language-Python"># Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN &lt;=&gt; operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding &lt;=&gt; %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs

</code></pre>
<h3 id="33-define-helper-functions-to-query-openai">3.3 Define helper functions to query OpenAI</h3><p>We define a helper function to get a completion response from an OpenAI model while we use the previously defined helper function, <code>get_embeddings</code>, to create an embedding for the user question. We use GPT-4o, but you can use any other model from OpenAI.</p><p>We also specify a number of parameters, such as limits of the maximum number of tokens in the model response and model temperature, which controls the randomness of the model, which you can modify to your liking:</p><pre><code class="language-Python"># Helper function: get text completion from OpenAI API
# Note we're using the latest gpt-3.5-turbo-0613 model
def get_completion_from_messages(messages, model="gpt-4o", temperature=0, max_tokens=1000):
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature, 
        max_tokens=max_tokens, 
    )
    return response.choices[0].message.content
</code></pre>
<h3 id="33-putting-it-all-together">3.3 Putting it all together</h3><p>We’ll define a function to process the user input by retrieving the most similar documents from our database and passing the user input, along with the relevant retrieved context to the OpenAI model to provide a completion response to.</p><p>Note that we modify the system prompt as well in order to influence the tone of the model’s response.</p><p>We pass to the model the content associated with the three most similar embeddings to the user input using the assistant role. You can also append the additional context to the user message.<br></p><pre><code class="language-Python"># Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
    delimiter = "```"

    #Step 1: Get documents related to the user input from database
    related_docs = get_top3_similar_docs(get_embeddings(user_input), conn)

    # Step 2: Get completion from OpenAI API
    # Set system message to help set appropriate tone and context for model
    system_message = f"""
    You are a friendly chatbot. \
    You can answer questions about timescaledb, its features and its use cases. \
    You respond in a concise, technically credible tone. \
    """

    # Prepare messages to pass to model
    # We use a delimiter to help the model understand the where the user_input starts and ends
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{delimiter}{user_input}{delimiter}"},
        {"role": "assistant", "content": f"Relevant Timescale case studies information: \n {related_docs[0][0]} \n {related_docs[1][0]} {related_docs[2][0]}"}   
    ]

    final_response = get_completion_from_messages(messages)
    return final_response
</code></pre>
<p>Let’s see an example of the model’s output to our original input question:</p><pre><code class="language-Python">input = "How is Timescale used in IoT?"
response = process_input_with_retrieval(input)
print(input)
print(response)
</code></pre>
<p>Here’s the model’s response:</p><pre><code>User input: How is Timescale used in IoT?

Model Response:

TimescaleDB is widely used in IoT applications due to its ability to efficiently handle time-series data, which is a common requirement in IoT systems. Here are some ways TimescaleDB is utilized in IoT:

1. **Data Ingestion and Storage**: IoT devices generate large volumes of time-stamped data. TimescaleDB's architecture is optimized for ingesting and storing this data efficiently, allowing for high write throughput and efficient storage through features like data compression.

2. **Real-time Analytics**: TimescaleDB supports real-time analytics, enabling IoT applications to process and analyze data as it arrives. This is crucial for applications that require immediate insights, such as monitoring systems or predictive maintenance.

3. **Historical Data Analysis**: With its support for continuous aggregates, TimescaleDB allows for efficient querying of historical data. This is useful for trend analysis, reporting, and long-term data storage, which are common in IoT use cases.

4. **Integration with IoT Platforms**: TimescaleDB can be integrated with various IoT platforms and tools, such as AWS IoT, MQTT, and Grafana, to manage, visualize, and analyze IoT data seamlessly.

5. **Scalability**: As IoT deployments grow, the volume of data can increase significantly. TimescaleDB's scalability features, such as partitioning and distributed hypertables, help manage this growth effectively.

Overall, TimescaleDB provides a robust solution for managing the unique challenges of IoT data, combining the benefits of a relational database with time-series capabilities.

</code></pre><p>We can also ask the model questions about specific documents in the database, in this case about specific TimescaleDB users who have spoken about their experience in a blog post:</p><pre><code class="language-Python">input_2 = "Tell me about Edeva and Hopara. How do they use Timescale?"
response_2 = process_input_with_retrieval(input_2)
print(input_2)
print(response_2)

</code></pre>
<p>Here’s the model output for the input question above; notice how it uses specific details from the blog posts.</p><pre><code>User input: Tell me about Edeva and Hopara. How do they use Timescale?


Model Response:

### Edeva

**Use Case:**
Edeva leverages TimescaleDB to manage and analyze data from their smart city systems, particularly focusing on their Actibump dynamic speed bump. The data collected includes:

1. **IoT Event Data:** Information such as vehicle speeds and environmental conditions.
2. **Alarm Data:** Alerts for sensor malfunctions or other issues.
3. **Status Data:** Self-checks and statistical data from IoT devices.
4. **Administrative Data:** Metadata about devices, such as configuration details.

**Key Features Utilized:**
- **Continuous Aggregations:** To speed up queries and make dashboards responsive.
- **Percentile Aggregations:** For calculating accurate percentile values without querying raw data.
- **SQL Compatibility:** Simplifies onboarding for developers familiar with SQL.

**Benefits:**
- **Performance:** Transitioned from sluggish to lightning-fast dashboards.
- **Ease of Use:** Developers could quickly adapt due to SQL familiarity.
- **Scalability:** Efficiently handles large datasets, such as hundreds of millions of records.

### Hopara

**Use Case:**
Hopara uses TimescaleDB to manage and visualize time-series data for their geospatial analytics platform. The platform integrates various data sources to provide insights into spatial and temporal trends.

**Key Features Utilized:**
- **Time-Series Data Management:** Efficiently stores and queries large volumes of time-series data.
- **Geospatial Capabilities:** Leverages PostgreSQL’s PostGIS extension for spatial queries.
- **Continuous Aggregations:** To pre-compute and speed up complex queries.

**Benefits:**
- **Scalability:** Handles large datasets with ease.
- **Performance:** Fast query execution for real-time analytics.
- **Integration:** Seamless integration with existing PostgreSQL tools and extensions.

Both Edeva and Hopara benefit from TimescaleDB’s ability to handle large volumes of time-series data efficiently, providing fast query performance and ease of use through SQL compatibility.
</code></pre><h2 id="conclusion">Conclusion</h2><p><a href="https://www.timescale.com/blog/rag-is-more-than-just-vector-search" rel="noreferrer">Retrieval-augmented generation (RAG)</a> is a powerful method of building applications with LLMs that enables you to teach foundation models about things they were not originally trained on, like private documents or recently published information.</p><p>This project is an example of how to create, store, and perform similarity search on <a href="https://www.tigerdata.com/blog/open-source-vs-openai-embeddings-for-rag" rel="noreferrer">OpenAI embeddings</a>. We used PostgreSQL + <a href="https://github.com/pgvector/pgvector"><u>pgvector</u></a> + <a href="https://github.com/timescale/pgvectorscale"><u>pgvectorscale</u></a> as our vector database to efficiently store and query the embeddings, enabling precise and relevant responses.</p><h2 id="timescaledb-postgresql">TimescaleDB + PostgreSQL</h2><p>And if you’re looking for a production PostgreSQL database for your vector workloads, <a href="https://console.cloud.timescale.com/signup"><u>try Timescale</u></a>. It’s free for 30 days, no credit card required.</p><h3 id="further-reading">Further reading</h3><p>Here are more blog posts about RAG with PostgreSQL and different tools:</p><ul><li><a href="https://www.timescale.com/blog/rag-is-more-than-just-vector-search/"><u>RAG Is More Than Just Vector Search</u></a></li><li><a href="https://timescale.com/blog/retrieval-augmented-generation-with-claude-sonnet-3-5-and-pgvector/"><u>Retrieval-Augmented Generation With Claude Sonnet 3.5 &amp; Pgvector</u></a></li><li><a href="https://www.timescale.com/blog/build-a-fully-local-rag-app-with-postgresql-mistral-and-ollama/"><u>Build a Fully Local RAG App With PostgreSQL, Mistral, and Ollama</u></a></li><li><a href="https://www.timescale.com/blog/building-an-ai-image-gallery-advanced-rag-with-pgvector-and-claude-sonnet-3-5/"><u>Building an AI Image Gallery: Advanced RAG With Pgvector and Claude Sonnet 3.5</u></a></li></ul><p></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Write Better Queries for Time-Series Data Analysis With Custom SQL Functions]]></title>
            <description><![CDATA[There is a better way to explore and analyze time-series data with PostgreSQL: hyperfunctions. Learn how to ease your workload with faster and simpler data analysis using TimescaleDB.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[Hyperfunctions]]></category>
            <category><![CDATA[Engineering]]></category>
            <dc:creator><![CDATA[JF Joly]]></dc:creator>
            <pubDate>Thu, 23 Jun 2022 13:02:54 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2022/06/marc-olivier-jodoin-NqOInJ-ttqM-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<h2 id="why-the-right-tools-matter-when-analyzing-time-series-data">Why the Right Tools Matter When Analyzing Time-Series Data</h2><p><strong>SQL is the lingua franca for analytics. </strong>As data proliferates, we need to find new ways to store, explore, and analyze it. <a href="https://timescale.ghost.io/blog/blog/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a/">We believe SQL is the best language for data analysis</a>. We’ve championed the benefits of SQL for several years, even when many were swapping it for custom domain-specific languages. Full SQL support was one of the <a href="https://timescale.ghost.io/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">key reasons</a> we chose to build TimescaleDB on top of PostgreSQL, the <a href="https://survey.stackoverflow.co/2022/#most-popular-technologies-database-prof">most loved database among developers</a>, rather than creating a custom query language. And we were right—SQL is making a comeback (although it never really went away) and has become the universal language for data analysis, with many NoSQL databases adding SQL interfaces to keep up.</p><p>In addition, most developers are familiar with SQL, along with most data scientists, data analysts, and other professionals who work with data. Whether you've taken classes at university, done an online course, or attended a boot camp, chances are that you probably have learned a bit of SQL along the way. So you and your fellow developers already know it, making it easier for teams to onboard new members and quickly extract value from the data. With a proprietary language, learning the language is in itself a barrier to using the data—you’ll have to ask another team to write the queries or rely on a separate data lake.</p><p><a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/"><strong>Time-series data</strong></a><strong> is ubiquitous. </strong>At Timescale, our mission is to serve developers worldwide and enable them to build exceptional data-driven products that measure everything that matters: software applications, industrial equipment, financial markets, blockchain activity, user actions, consumer behavior, machine learning models, climate change, and more.</p><p>And time-series data comes at you fast, sometimes generating millions of data points per second. Because of the sheer volume and rate of information, time-series data can be complex to query and analyze, even in SQL.</p><p><strong>TimescaleDB hyperfunctions make it easier to manipulate and analyze time-series datasets with fewer lines of SQL code</strong>. Hyperfunctions are purpose-built for the most common and difficult time-series and analytical queries developers write today in SQL. Using hyperfunctions makes you more productive when querying time-series data, which means you can spend less time creating reports, dashboards, and visualizations involving <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>, and spend more time acting on the insights that your work unearths!</p><h2 id="handling-time-series-data-meet-hyperfunctions">Handling Time-Series Data: Meet Hyperfunctions</h2><p>There are over 70 different TimescaleDB hyperfunctions ready to use today. Here are some of the most popular ones and how they can help you handle your time-series data:<br></p><ul><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#solved-how-to-query-arbitrary-time-intervals-with-date_trunc"><strong>Time-based analysis</strong></a><strong>:</strong> <code>time_bucket()</code> makes time-based analysis simpler and easier by enabling you to analyze data over arbitrary time intervals using succinct queries.</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#first-and-last-"><strong><code>first()</code> and <code>last()</code></strong></a> allow you to get the value of one column as ordered by another (2x faster in TimescaleDB 2.7!).</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#simpler-time-weighted-averages"><strong>Time-weighted averages</strong></a><strong>:</strong> <code>time_weight()</code> and related hyperfunctions for working with time-weighted averages offer a more elegant way to get an unbiased average when working with irregularly sampled data.</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#enhanced-query-readability-and-maintenance-with-function-pipelines-"><strong>Function pipelines</strong></a> enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL (currently experimental).</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#better-data-summaries-using-percentile-approximation"><strong>Percentile approximation</strong></a> brings percentile analysis to more workflows, enabling you to understand the distribution of your data efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets. When used with<a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/"> continuous aggregates</a>, you can compute percentiles over any time range of your dataset in near real-time and use them for baselining and normalizing incoming data.</li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#easier-frequency-analysis-with-frequency-aggregates"><strong>Frequency analysis</strong></a><strong>:</strong> <code>Freq_agg()</code> and related frequency analysis hyperfunctions more efficiently find the most common elements out of a set of vastly more varied values vs. brute force calculation.</li><li><a href="https://docs.timescale.com/api/latest/hyperfunctions/histogram/"><strong>Histogram</strong></a> shows the data distribution and can offer a better understanding of the segments compared to an average (<a href="https://statisticsbyjim.com/basics/histograms/">more on histograms</a>).</li><li><a href="https://timescale.ghost.io/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling/"><strong>Downsampling</strong></a><strong>: </strong>ASAP smoothing smooths datasets to highlight the most important features when graphed. Largest Triangle Three Buckets Downsampling or <code>lttb()</code> reduces the number of elements in a dataset while retaining important features when graphed. <a href="https://timescale.ghost.io/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling/">See how to apply our downsampling hyperfunctions in Grafana.</a></li><li><a href="https://timescale.ghost.io/blog/how-to-write-better-queries-for-time-series-data-analysis-using-custom-sql-functions/#more-memory-efficient-count-distinct-queries"><strong>Memory efficient COUNT DISTINCTs</strong></a>: HyperLogLog is a probabilistic cardinality estimator that uses significantly less memory than the equivalent COUNT DISTINCT query. It is ideal for use in a continuous aggregate for large datasets.</li></ul><p>We created new SQL functions for each of these time-series analysis and manipulation capabilities. This contrasts with other efforts to improve the developer experience by introducing new SQL syntax. While introducing new syntax with new keywords and constructs may have been easier from an implementation perspective, we made the deliberate decision not to do so since we believe it leads to a worse experience for the end-user. </p><p>New SQL syntax means existing drivers, libraries, and tools may no longer work. That can leave developers with more problems than solutions as their favorite tools, libraries, or drivers may not support the new syntax or require time-consuming modifications. On the other hand, new SQL functions mean that your query will run in every visualization tool, database admin tool, or data analysis tool. </p><p>We have the freedom to create custom functions, aggregates, and procedures that help developers better understand and work with their data, and ensure all their drivers and interfaces still work as expected!<br><br>We will now dive into each hyperfunction category that we mentioned and give examples of when, why, and how to use them, plus resources to continue your learning.</p><p>TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. <a href="https://console.cloud.timescale.com/">Get started with a free Timescale trial</a>—no credit card required. Or <a href="https://docs.timescale.com/install/latest/">download for free</a> with TimescaleDB self-managed.</p><p>If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/#analyze-data-using-timescaledb-continuous-aggregates-and-hyperfunctions">start our tutorial</a>, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).  </p><p>Can’t find the function you need?<strong> </strong>Open an issue on our <a href="https://github.com/timescale/timescaledb-toolkit/issues">GitHub project</a> or contact us on <a href="https://timescaledb.slack.com/">Slack</a> or via the <a href="http://timescale.com/forum/">Timescale Community Forum</a>. We love to work with our users to simplify SQL!</p><h2 id="solved-how-to-query-arbitrary-time-intervals-with-datetrunc">Solved: How to Query Arbitrary Time-Intervals With date_trunc</h2><p>When using PostgreSQL, the <a href="https://www.postgresql.org/docs/current/functions-datetime.html"><code>date_trunc</code></a><a href="https://www.postgresql.org/docs/current/functions-datetime.html"> function</a> can be useful when you want to aggregate information over an interval of time. <code>date_trunc</code> truncates a <code>TIMESTAMP</code> or an <code>INTERVAL</code> value based on a specified date part (e.g., hour, week, or month) and returns the truncated timestamp or interval. For example, <code>date_trunc</code> can aggregate by one second, one hour, one day, or one week. However, you often want to see aggregates by the time intervals that matter most to your use case, which may be intervals like 30 seconds, 5 minutes, 12 hours, etc. This can get pretty complicated in SQL, just look at the query below which analyzes taxi ride activity in five-minute time intervals:</p><p><strong>Regular PostgreSQL: Taxi rides taken every five minutes</strong></p><pre><code class="language-SQL">SELECT
  EXTRACT(hour from pickup_datetime) as hours,
  trunc(EXTRACT(minute from pickup_datetime) / 5)*5 AS five_mins,
  COUNT(*)
FROM rides
WHERE pickup_datetime &lt; '2016-01-02 00:00'
GROUP BY hours, five_mins;
</code></pre>
<p><strong>The <code>time_bucket()</code> hyperfunction makes it easy to query your data in whatever time interval is most relevant to your analysis use case</strong>. <code>time_bucket()</code> enables you to aggregate data by arbitrary time intervals (e.g., 10 seconds, 5 minutes, 6 hours, etc.), and gives you flexible groupings and offsets, instead of just second, minute, hour, and so on.</p><p>In addition to allowing more flexible time-series queries, <code>time_bucket()</code> also allows you to write these queries in a simpler way. Just look much simpler the query from the example above is to write and understand when using the <code>time_bucket()</code> hyperfunction:</p><p><strong>TimescaleDB hyperfunctions: Taxi rides taken every five minutes</strong></p><pre><code class="language-SQL">-- How many rides took place every 5 minutes for the first day of 2016?
SELECT time_bucket('5 minute', pickup_datetime) AS five_min, count(*)
FROM rides
WHERE pickup_datetime &lt; '2016-01-02 00:00'
GROUP BY five_min
ORDER BY five_min;
</code></pre>
<p>If you’d like even more flexibility when aggregating your data, you can test out the  hyperfunction <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket</code></a>, which is an updated version of the original <code>time_bucket()</code> hyperfunction. <code>time_bucket</code> enables you to bucket your data by years and months, in addition to second, minute, and hour time intervals. This allows you to easily do monthly cohort analysis or other multiple-month-based reports in SQL.</p><pre><code class="language-SQL">SELECT time_bucket('3 month', date '2021-08-01');
 time_bucket
----------------
 2021-07-01
(1 row)
</code></pre>
<p><code>time_bucket</code> also features custom timezone support, which enables you to write queries like the one below, which illustrates using it to bucket data in the Europe/Moscow region:</p><pre><code class="language-SQL">-- note that timestamptz is displayed differently depending on the session parameters
SET TIME ZONE 'Europe/Moscow';

SELECT time_bucket('1 month', timestamptz '2001-02-03 12:34:56 MSK', timezone =&gt; 'Europe/Moscow');
     time_bucket
------------------------
 2001-02-01 00:00:00+03
</code></pre>
<p>Missing data or gaps is a common occurrence when capturing hundreds or thousands of time-series readings per second or minute. This can happen due to irregular sampling intervals, or you have experienced an outage of some sort. </p><p>The <code>time_bucket_gapfill()</code> hyperfunction enables you to create additional rows of data in any gaps, ensuring that the returned rows are in chronological order and contiguous. To learn more about gappy data, read our blog <a href="https://timescale.ghost.io/blog/sql-functions-for-time-series-analysis/">Mind the Gap: Using SQL Functions for Time-Series Analysis</a>.</p><p>Here’s an example of <code>time_bucket_gapfill()</code> in action, where we find the daily average temperature for a certain device and use the <code>locf()</code> function to carry the last observation forward in the case we have gaps in our data:</p><pre><code class="language-SQL">SELECT
  time_bucket_gapfill('1 day', time, now() - INTERVAL '1 week', now()) AS day,
  device_id,
  avg(temperature) AS value,
  locf(avg(temperature))
FROM metrics
WHERE time &gt; now () - INTERVAL '1 week'
GROUP BY day, device_id
ORDER BY day;

           day          | device_id | value | locf
------------------------+-----------+-------+------
 2019-01-10 01:00:00+01 |         1 |       |
 2019-01-11 01:00:00+01 |         1 |   5.0 |  5.0
 2019-01-12 01:00:00+01 |         1 |       |  5.0
 2019-01-13 01:00:00+01 |         1 |   7.0 |  7.0
 2019-01-14 01:00:00+01 |         1 |       |  7.0
 2019-01-15 01:00:00+01 |         1 |   8.0 |  8.0
 2019-01-16 01:00:00+01 |         1 |   9.0 |  9.0
(7 rows)
</code></pre>
<p>The last observation carried forward or <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/"><code>locf(</code></a><code>)</code> function allows you to carry forward the last seen value in an aggregation group. You can only use it in an aggregation query with <code>time_bucket_gapfill</code>.</p><p>To learn more about using the <code>time_bucket</code> family of hyperfunctions, read the <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/">docs</a>, and get started with our <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nyc-taxi-cab/">tutorial</a>, which uses <code>time_bucket()</code> to analyze a real-world IoT dataset.</p><h2 id="simpler-time-weighted-averages">Simpler Time-Weighted Averages</h2><p>If you’re in a situation where you don't have regularly sampled data, getting a representative average over a period of time can be a complex and time-consuming query to write. For example, irregularly sampled data, and thus the need for time-weighted averages, frequently occurs in the following cases:</p><ul><li>Industrial IoT, where teams “compress” data by only sending points when the value changes.</li><li>Remote sensing, where sending data back from the edge can be costly, so you only send high-frequency data for the most critical operations.</li><li>Trigger-based systems, where the sampling rate of one sensor is affected by the reading of another (i.e., a security system that sends data more frequently when a motion sensor is triggered).</li></ul><p>Time-weighted averages are a way to get an unbiased average when you are working with irregularly sampled data.<br><br>To illustrate the value of a hyperfunction to find time-weighted averages, consider the following example of a simple table modeling freezer temperature:</p><pre><code class="language-SQL">CREATE TABLE freezer_temps (
	freezer_id int,
	ts timestamptz,
	temperature float);
</code></pre>
<p>And some irregularly sampled time-series data representing the freezer temperature:</p><pre><code class="language-SQL">INSERT INTO freezer_temps VALUES 
( 1, '2020-01-01 00:00:00+00', 4.0), 
( 1, '2020-01-01 00:05:00+00', 5.5), 
( 1, '2020-01-01 00:10:00+00', 3.0), 
( 1, '2020-01-01 00:15:00+00', 4.0), 
( 1, '2020-01-01 00:20:00+00', 3.5), 
( 1, '2020-01-01 00:25:00+00', 8.0), 
( 1, '2020-01-01 00:30:00+00', 9.0), 
( 1, '2020-01-01 00:31:00+00', 10.5), -- door opened!
( 1, '2020-01-01 00:31:30+00', 11.0), 
( 1, '2020-01-01 00:32:00+00', 15.0), 
( 1, '2020-01-01 00:32:30+00', 20.0), -- door closed
( 1, '2020-01-01 00:33:00+00', 18.5), 
( 1, '2020-01-01 00:33:30+00', 17.0), 
( 1, '2020-01-01 00:34:00+00', 15.5), 
( 1, '2020-01-01 00:34:30+00', 14.0), 
( 1, '2020-01-01 00:35:00+00', 12.5), 
( 1, '2020-01-01 00:35:30+00', 11.0), 
( 1, '2020-01-01 00:36:00+00', 10.0), -- temperature stabilized
( 1, '2020-01-01 00:40:00+00', 7.0),
( 1, '2020-01-01 00:45:00+00', 5.0);

</code></pre>
<p>Calculating the time-weighted average temperature of the freezer using regular SQL functions would look something like this:</p><p><strong>Time-weighted averages using regular SQL</strong></p><pre><code class="language-SQL">WITH setup AS (
	SELECT lag(temperature) OVER (PARTITION BY freezer_id ORDER BY ts) as prev_temp, 
		extract('epoch' FROM ts) as ts_e, 
		extract('epoch' FROM lag(ts) OVER (PARTITION BY freezer_id ORDER BY ts)) as prev_ts_e, 
		* 
	FROM  freezer_temps), 
nextstep AS (
	SELECT CASE WHEN prev_temp is NULL THEN NULL 
		ELSE (prev_temp + temperature) / 2 * (ts_e - prev_ts_e) END as weighted_sum, 
		* 
	FROM setup)
SELECT freezer_id,
    avg(temperature), -- the regular average
	sum(weighted_sum) / (max(ts_e) - min(ts_e)) as time_weighted_average 
</code></pre>
<p>But, with the TimescaleDB <code>time_weight()</code> hyperfunction, we reduce this potentially tedious to write and confusing to read query to a much simpler five-line query:<br></p><pre><code class="language-SQL">SELECT freezer_id, 
	avg(temperature), 
	average(time_weight('Linear', ts, temperature)) as time_weighted_average 
FROM freezer_temps
GROUP BY freezer_id;

freezer_id |  avg  | time_weighted_average 
------------+-------+-----------------------
          1 | 10.2  |     6.636111111111111
</code></pre>
<p>To learn more about using time-weighted average hyperfunctions, read the<a href="https://docs.timescale.com/api/latest/hyperfunctions/time-weighted-averages/"> docs</a> and see our explainer blog post: <a href="https://timescale.ghost.io/blog/what-time-weighted-averages-are-and-why-you-should-care/">What time-weighted averages are and why you should care</a>.</p><h2 id="better-data-summaries-using-percentile-approximation">Better Data Summaries Using Percentile Approximation</h2><p>Many developers choose to use averages and other summary statistics more frequently than percentiles because they are significantly “cheaper” to calculate over large time-series datasets, both in computational resources and time.</p><p>As we were designing hyperfunctions, we thought about how we could capture the benefits of percentiles (e.g., robustness to outliers, better correspondence with real-world impacts) while avoiding some of the pitfalls of calculating exact percentiles. </p><p>TimescaleDB’s<strong> percentile approximation hyperfunctions</strong> enable you to understand your data distribution efficiently (e.g., 10th percentile, mean, or 50th percentile, 90th percentile, etc.) without performing expensive computations over gigantic time-series datasets.</p><p>With relatively large datasets, you can often accept some accuracy trade-offs to avoid running into issues of high memory footprint and network costs while enabling percentiles to be computed more efficiently in parallel and used on streaming data. (In this post, you can learn more about the <a href="https://timescale.ghost.io/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">design decisions and trade-offs made in TimescaleDB’s percentile approximation hyperfunctions design</a>.)</p><p>TimescaleDB has a whole family of <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/">percentile approximation hyperfunctions</a>. The simplest way to call them is to use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/">percentile_agg aggregate</a> along with the<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/"> approx_percentile accessor</a>. For example, here’s how we might calculate the 10th, 50th (mean), and 90th percentiles of the response time of a particular API:</p><pre><code class="language-SQL">SELECT 
    approx_percentile(0.1, percentile_agg(response_time)) as p10, 
    approx_percentile(0.5, percentile_agg(response_time)) as p50, 
    approx_percentile(0.9, percentile_agg(response_time)) as p90 
FROM responses;
</code></pre>
<p>Hyperfunctions for percentile approximation can also be used in <a href="https://timescale.ghost.io/blog/how-we-made-data-aggregation-better-and-faster-on-postgresql-with-timescaledb-2-7/">TimescaleDB´s continuous aggregates</a> which make aggregate queries on very large datasets run faster. Continuous aggregates continuously and incrementally store the results of an aggregation query in the background. So, when you run the query, only the changed data needs to be computed, not the entire dataset.</p><p>That is a huge advantage compared to exact percentiles because you can now do things like baselining and alerting on longer periods without recalculating from scratch every time!</p><p>For example, here’s how you can use continuous aggregates to identify recent outliers and investigate potential problems. First, we create a one-hour aggregation from the <a href="https://www.tigerdata.com/blog/database-indexes-in-postgresql-and-timescale-cloud-your-questions-answered" rel="noreferrer">hypertable</a> <code>responses</code>:</p><pre><code class="language-SQL">CREATE TABLE responses(
	ts timestamptz, 
	response_time DOUBLE PRECISION);
SELECT create_hypertable('responses', 'ts');
</code></pre>
<pre><code class="language-SQL">CREATE MATERIALIZED VIEW responses_1h_agg
WITH (timescaledb.continuous)
AS SELECT 
    time_bucket('1 hour'::interval, ts) as bucket,
    percentile_agg(response_time)
FROM responses
GROUP BY time_bucket('1 hour'::interval, ts);
</code></pre>
<p>To find outliers, we can find the data in the last 30 seconds greater than the 99th percentile:</p><pre><code class="language-SQL">SELECT * FROM responses 
WHERE ts &gt;= now()-'30s'::interval
AND response_time &gt; (
	SELECT approx_percentile(0.99, percentile_agg)
	FROM responses_1h_agg
	WHERE bucket = time_bucket('1 hour'::interval, now()-'1 hour'::interval)
);
</code></pre>
<p>To learn more about using percentile approximation hyperfunctions, read the <a href="https://timescale.ghost.io/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">docs</a>, try our <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">tutorial using real-world NFL data</a> and see our <a href="https://timescale.ghost.io/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/">explainer blog post on why percentile approximation is more useful than averages</a>.</p><h2 id="first-and-last">first( )and last( )</h2><p>Another common problem is finding the first or last values for multiple time series. That often occurs in IoT scenarios, where you want to monitor devices in different locations, but each device sends back data at different times (as devices can go offline, experience connectivity issues,  batch transmit data, or simply have different sampling rates).</p><p>The <code>last</code> hyperfunction allows you to get the value of one column as ordered by another. For example, <code>last(temperature, time)</code> returns the latest temperature value based on time within an aggregate group. </p><p>This way, you can write queries more easily which, for example, will find the last recorded temperature at multiple locations, as each location might have different rates of data being sampled and recorded:</p><pre><code class="language-SQL">SELECT location, last(temperature, time)
  FROM conditions
  GROUP BY location;
</code></pre>
<p>Similarly, the <code>first</code> hyperfunction also allows you to get the value of one column as ordered by another. <code>first(temperature, time)</code> returns the earliest temperature value based on time within an aggregate group:</p><pre><code class="language-SQL">SELECT device_id, first(temp, time)
FROM metrics
GROUP BY device_id;
</code></pre>
<p><code>first()</code> and <code>last()</code> can also be used in more complex queries, such as finding the latest value within a specific time interval. In the example below, we find the last temperature recorded for each device in five minutes throughout the past day:</p><pre><code class="language-SQL">SELECT device_id, time_bucket('5 minutes', time) AS interval,
  last(temp, time)
FROM metrics
WHERE time &gt; now () - INTERVAL '1 day'
GROUP BY device_id, interval
ORDER BY interval DESC;
</code></pre>
<p>In TimescaleDB 2.7, we’ve made <a href="https://github.com/timescale/timescaledb/pull/3943">improvements</a> to make queries with the <code>first()</code> and <code>last()</code> hyperfunctions up to twice as fast and make memory usage near constant.</p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🚀</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> The last and first commands do not use indexes but perform a sequential scan through their groups. They are primarily used for ordered selection within a GROUP BY aggregate and not as an alternative to an ORDER BY time DESC LIMIT 1 clause to find the latest value (which uses indexes).</em></i></div></div><p>To learn more, see the docs for <a href="https://docs.timescale.com/api/latest/hyperfunctions/first/"><code>first(</code></a><code>)</code> and <a href="https://docs.timescale.com/api/latest/hyperfunctions/last/"><code>last(</code></a><code>)</code>.</p><h2 id="more-memory-efficient-count-distinct-queries">More Memory Efficient COUNT DISTINCT Queries<br></h2><p>Calculating the exact number of distinct values in a large dataset with <a href="https://www.tigerdata.com/learn/how-to-handle-high-cardinality-data-in-postgresql" rel="noreferrer">high cardinality</a> requires lots of computational resources, which can impact the query performance and experience of your database's concurrent users. </p><p>To solve this issue, TimescaleDB provides hyperfunctions to calculate <a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/">approximate COUNT DISTINCTs</a>. Approximate count distincts do not calculate the exact cardinality of a dataset, but rather estimate the number of unique values, in order to improve compute time. We use <a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog</a>, a probabilistic cardinality estimator that uses significantly less memory than the equivalent <code>COUNT DISTINCT</code> query.  </p><p><a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/hyperloglog/"><code>Hyperloglog(</code></a><code>)</code> is an approximation object for <code>COUNT DISTINCT</code> queries. And the <a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/distinct_count/"><code>distinct_count(</code></a><code>)</code> accessor function gets the number of distinct values from a HyperLogLog object, as illustrated in the example below, which efficiently estimates the number of unique NFTs and collections in a hypothetical NFT marketplace:<br></p><pre><code class="language-SQL">SELECT
  distinct_count(hyperloglog(32768, asset_id)) AS nft_count,
  distinct_count(hyperloglog(32768, collection_id)) AS collection_count
FROM nft_sales
WHERE payment_symbol = 'ETH' AND time &gt; NOW()-INTERVAL '3 months'
</code></pre>
<p>You can also use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/approx_count_distincts/stderror/"><code>std_error(</code></a><code>)</code> function to estimate the relative standard error of the HyperLogLog compared to running <code>COUNT DISTINCT</code> directly. <br>To learn more about the approximate <code>COUNT DISTINCT</code> hyperfunctions, read the<a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/approx-count-distincts/"> docs</a>.</p><h2 id="enhanced-query-readability-and-maintenance-with-function-pipelines">Enhanced Query Readability and Maintenance With Function Pipelines </h2><p></p><div class="kg-card kg-callout-card kg-callout-card-grey"><div class="kg-callout-emoji">🚀</div><div class="kg-callout-text"><i><b><strong class="italic" style="white-space: pre-wrap;">Note:</strong></b></i><i><em class="italic" style="white-space: pre-wrap;"> In the spirit of </em></i><a href="https://www.timescale.com/blog/blog/move-fast-but-dont-break-things-introducing-the-experimental-schema-with-new-experimental-features-in-timescaledb-2-4/"><i><em class="italic" style="white-space: pre-wrap;">moving fast and not breaking things</em></i></a><i><em class="italic" style="white-space: pre-wrap;">, the hyperfunctions in this section are released as experimental—please play around with them but don’t use them in production.</em></i></div></div><p>At Timescale, we’re huge fans of SQL. But as we’ve seen in many examples above, SQL can get quite unwieldy for certain kinds of analytical and time-series queries. Enter TimescaleDB Function Pipelines.</p><p><strong>TimescaleDB Function Pipelines</strong> radically improve the developer ergonomics of analyzing data in PostgreSQL and SQL, by applying principles from <a href="https://en.wikipedia.org/wiki/Functional_programming">functional programming</a> and popular tools like <a href="https://pandas.pydata.org/docs/index.html">Python’s Pandas</a> and <a href="https://prometheus.io/docs/prometheus/latest/querying/basics/">PromQL</a>. In short, they improve your coding productivity, making your SQL code easier for others to comprehend and maintain.</p><p>Inspired by functional programming languages, Function Pipelines enable you to analyze data by composing multiple functions, leading to a simpler, cleaner way of expressing complex logic in PostgreSQL.</p><p>And the best part: we built Function Pipelines in a fully PostgreSQL-compliant way! We did not change any SQL syntax, meaning that any tool that speaks PostgreSQL will be able to support data analysis using function pipelines.</p><p>To understand the power of TimescaleDB Function Pipelines, consider the following PostgreSQL query.</p><p><strong>Regular PostgreSQL query:</strong><br></p><pre><code class="language-SQL">SELECT device_id, 
	sum(abs_delta) as volatility
FROM (
	SELECT device_id, 
		abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts))
        	as abs_delta 
	FROM measurements
	WHERE ts &gt;= now() - '1 day'::interval) calc_delta
GROUP BY device_id;
</code></pre>
<h2 id=""></h2><h2 id="supercharge-your-productivity-with-hyperfunctions-today">Supercharge Your Productivity With Hyperfunctions Today</h2><p><strong>Get started today: </strong>TimescaleDB hyperfunctions come pre-loaded and ready to use on every hosted and managed database service in Timescale, the easiest way to get started with TimescaleDB. <a href="https://console.cloud.timescale.com/signup">Get started with a free Timescale trial</a>—no credit card required. Or <a href="https://docs.timescale.com/install/latest/">download for free</a> with TimescaleDB self-managed.</p><p>If you’d like to jump straight into using TimescaleDB hyperfunctions on a real-world dataset, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/#analyze-data-using-timescaledb-continuous-aggregates-and-hyperfunctions">start our tutorial</a>, which uses hyperfunctions to uncover insights about players and teams from the NFL (American football).  </p><p><strong>Learn more: </strong>If you’d like to learn more about TimescaleDB hyperfunctions and how to use them for your use case, read our <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/hyperfunctions/">How-To Guide</a> and the hyperfunctions <a href="https://docs.timescale.com/api/latest/hyperfunctions/">documentation</a>. </p><p>Can’t find the function you need?<strong> </strong>Open an issue on our <a href="https://github.com/timescale/timescaledb-toolkit/issues">GitHub project</a> or contact us on <a href="https://timescaledb.slack.com/">Slack</a> or via the <a href="http://timescale.com/forum/">Timescale Community Forum</a>. We love to work with our users to simplify SQL!<br><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speed Up Grafana by Auto-Switching Between Different Aggregations With Postgres]]></title>
            <description><![CDATA[Read a step-by-step guide on enabling Grafana "auto-switching" between PostgreSQL aggregations, depending on the time interval.]]></description>
            <link>https://www.tigerdata.com/blog/speed-up-grafana-autoswitching-postgresql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/speed-up-grafana-autoswitching-postgresql</guid>
            <category><![CDATA[Data Visualization]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Tue, 11 Aug 2020 14:13:02 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2020/08/Autoswtiching-header.gif">
            </media:content>
            <content:encoded><![CDATA[<p><strong>Updated March 4, 2026</strong><em> - Learn how (and why) to speed up your Grafana drill-downs using PostgreSQL to allow "auto-switching" between aggregations, depending on the time interval you select.</em></p><h2 id="the-problem-grafana-is-slow-to-load-visualizations-especially-for-non-aggregated-fine-grained-data">The problem: Grafana is slow to load visualizations, especially for non-aggregated, fine-grained data</h2><p>The&nbsp;<a href="https://grafana.com/" rel="nofollow">Grafana</a>&nbsp;UI is great for drilling down into your data. However, for large amounts of data with second, millisecond, or even nanosecond time granularity, it can be frustratingly slow and result in higher resource usage.</p><p>For example, take this graph of all New York City taxi rides during the month of January 2016:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/grafana_autoswtiching_loading.gif" class="kg-image" alt="Grafana graph loading slowly" loading="lazy" width="1600" height="597" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/grafana_autoswtiching_loading.gif 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/grafana_autoswtiching_loading.gif 1000w, https://timescale.ghost.io/blog/content/images/2022/01/grafana_autoswtiching_loading.gif 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">An example of how slow drill-downs into data can be in Grafana</span></figcaption></figure><p>One common workaround: instead of querying raw data and aggregating on the fly, you query and visualize data from&nbsp;<em>aggregates</em>&nbsp;of your raw data (e.g., one-minute, one-hour, or one-day rollups).</p><p>For PostgreSQL data sources, we do this by aggregating data into views and querying those instead, and for TimescaleDB, we use continuous aggregates—think “automatically refreshing materialized views” (for more, see the&nbsp;<a href="https://www.tigerdata.com/docs/use-timescale/latest/continuous-aggregates/" rel="nofollow">continuous aggregates docs</a>).</p><p>However, this often leads to several Grafana panels, each querying the same data aggregated at different granularities. For example, you might capture the same metric over time but set up aggregates at various intervals, such as in minute, hourly, and daily intervals.</p><p>This then requires three separate panels, one for each aggregated interval.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png" class="kg-image" alt="Three different Grafana graphs showing rides from daily and hourly aggregates as well as raw data" loading="lazy" width="1238" height="1144" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-30-at-4.42.57-PM.png 1238w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example of three panels all showing taxi rides over January 2016 but in different time granularities (daily, hourly, and per minute, from top to bottom)</span></figcaption></figure><p>But, what if we could use&nbsp;<em>one universal panel</em>&nbsp;that could “automatically” switch between minute, hourly, daily, or any other arbitrary aggregations of our data, depending on the time period we’d like to query and analyze? This would speed up queries and use resources like CPU more efficiently.</p><p>Enter the PostgreSQL&nbsp;<code>UNION ALL</code>&nbsp;function.</p><h2 id="the-solution-use-postgres-union-all">The Solution: Use Postgres <code>UNION ALL</code></h2><p>When we use PostgreSQL as our Grafana data source, we can write a single query that automatically switches between different aggregated views of our data (e.g., daily, hourly, weekly views, etc.) in the same Grafana visualization (!).</p><p>🔑&nbsp;<strong>The key</strong>: we (1) use the&nbsp;<code>UNION ALL</code>&nbsp;function to write separate queries to pull data with different aggregations, and (2) then use the&nbsp;<code>WHERE</code>&nbsp;clause to switch the table (or continuous aggregate view) being queried, depending on the length of the time-interval selected (from either the time picker or by highlighting the time period in a graph).</p><p>This allows us to drill arbitrarily deep into our data and makes loading the data as efficient and fast as possible, saving time and CPU resources. (In Grafana, drilling into data is typically done by zooming in and out, highlighting the time period of interest in the graph as shown in the image below.)</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif" class="kg-image" alt="Autoswitching between daily and hourly aggregates and raw data depending on time period selected" loading="lazy" width="1280" height="720" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif 1000w, https://timescale.ghost.io/blog/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127.gif 1280w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Example of auto-switching between different data aggregations depending on the time interval selected. Learn how to create this example in the tutorial below.</span></figcaption></figure><h2 id="try-it-yourself-implementation-in-grafana-sample-queries">Try It Yourself: Implementation in Grafana &amp; Sample Queries</h2><p>To help you get up and running with&nbsp;<code>UNION ALL</code>, I’ve put together a short step-by-step guide and a few sample queries (which you can modify to suit your project, app, and the metrics you care about).</p><h3 id="scenario">Scenario</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#scenario"></a></p><p>We’ll use the use case of monitoring IoT devices, specifically taxis equipped with sensors. For reference, we’ll use a dataset containing all New York City taxi ride activity for January 2016 from the&nbsp;<a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" rel="nofollow">New York Taxi and Limousine Commission</a>&nbsp;(NYC TLC).</p><h3 id="prerequisites">Prerequisites</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#prerequisites"></a></p><ul><li><a href="https://www.tigerdata.com/products/" rel="nofollow">TimescaleDB instance</a>&nbsp;(Tiger Cloud or self-hosted) running PostgreSQL 11+</li><li><a href="https://grafana.com/" rel="nofollow">Grafana instance</a>&nbsp;(cloud or self-hosted)</li><li>TimescaleDB instance connected to Grafana (see&nbsp;<a href="https://www.tigerdata.com/docs/tutorials/latest/grafana/" rel="nofollow">this tutorial</a>&nbsp;for more)</li><li>Use the queries below to create two continuous aggregates with refresh policies. These will be the aggregate views we switch between in our Grafana visualization:</li></ul><p>To create daily aggregates:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW rides_daily
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('1 day', pickup_datetime) AS day, COUNT(*) AS ride_count
    FROM rides
    GROUP BY day
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_daily',
    start_offset =&gt; INTERVAL '1 month',
    end_offset =&gt; INTERVAL '1 day',
    schedule_interval =&gt; INTERVAL '1 day');</code></pre><p><em>SQL query to create daily aggregates of rides during January 2016</em></p><p>This computes a roll-up of the total number of rides taken during each day during the time period of our data (January 2016).</p><p>To create hourly aggregates:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW rides_hourly
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('1 hour', pickup_datetime) AS hour, COUNT(*) AS ride_count
    FROM rides
    GROUP BY hour
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_hourly',
    start_offset =&gt; INTERVAL '1 month',
    end_offset =&gt; INTERVAL '1 hour',
    schedule_interval =&gt; INTERVAL '1 hour');</code></pre><p><em>SQL query to create hourly aggregates of rides during January 2016</em></p><p>This computes a roll-up of the total number of rides taken during each hour during the time period of our data.</p><p>For more on how continuous aggregates work, see&nbsp;<a href="https://www.tigerdata.com/docs/use-timescale/latest/continuous-aggregates/" rel="nofollow">these docs</a>.</p><h3 id="example-1-auto-switch-between-daily-aggregate-hourly-aggregate-and-raw-data">Example 1: Auto-switch between daily aggregate, hourly aggregate, and raw data</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#example-1-auto-switch-between-daily-aggregate-hourly-aggregate-and-raw-data"></a></p><p>In the example below, we have a query using&nbsp;<code>UNION ALL</code>, where we only select a specific table or view, depending on the length of time selected interval in the Grafana UI (controlled by the&nbsp;<code>$__timeFrom</code>&nbsp;and&nbsp;<code>$__timeTo</code>&nbsp;macros in Grafana).</p><p>As the comments in the code below show, we use daily aggregates for intervals greater than 14 days, hourly aggregates for intervals between 3 and 14 days, and per-minute aggregates calculated on the fly from raw data for intervals less than 3 days:</p><p><strong>Switching between daily aggregation, hourly aggregation, and minute aggregations on raw data</strong></p><pre><code class="language-SQL">-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp &gt; '14 days'::interval AND $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND $__timeFilter(hour)
UNION ALL
-- Use raw data (minute intervals) intervals between 0 and 3 days
SELECT * FROM
    (SELECT time_bucket('1m',pickup_datetime) AS time, count(*), 'minute' AS metric
    FROM rides
    WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp &lt; '3 days'::interval AND $__timeFilter(pickup_datetime)
    GROUP BY 1) minute
ORDER BY 1;</code></pre><p><em>Query to switch between daily aggregation, hourly aggregation, and per-minute aggregations created on the fly using raw data</em></p><p><strong>This produces the following behavior in our Grafana panels:</strong></p><p>Querying daily aggregates for intervals greater than 14 days:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png" class="kg-image" alt="Graph showing rides taking place in daily intervals for an interval greater than 14 days" loading="lazy" width="1600" height="752" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-27-at-2.46.44-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The graph is powered by the daily aggregate view for intervals greater than 14 days</span></figcaption></figure><p>Querying hourly aggregates for intervals between 3-14 days:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png" class="kg-image" alt="Graph showing rides taking place in hourly intervals for an interval between 3 and 14 days" loading="lazy" width="1600" height="749" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.14-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The graph is powered by the hourly aggregate view for intervals between 3 and 14 days</span></figcaption></figure><p>Querying raw data for intervals less than 3 days:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png" class="kg-image" alt="Graph showing rides taking place in minute intervals for an interval less than 3 days" loading="lazy" width="1600" height="751" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Screen-Shot-2020-07-27-at-2.47.36-PM.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">The graph is powered by rolling up raw data into 1-minute intervals on the fly for intervals of less than 3 days</span></figcaption></figure><p>This allows you to automatically switch between different aggregations of data, depending on the length of the time interval selected. Notice how the granularity of the data gets richer as we drill down from looking at data over the month of January to looking at data in a single day:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif" class="kg-image" alt="Graphing changing from daily interval to hourly interval to minute interval as we zoom in" loading="lazy" width="1280" height="720" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif 1000w, https://timescale.ghost.io/blog/content/images/2022/01/final_5f1f1f7b0c9e000015634870_766127-1.gif 1280w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Demo of automatically switching between daily, hourly, and minute aggregations of data, depending on time interval selected</span></figcaption></figure><h3 id="example-2-auto-switch-between-daily-hourly-and-10-minute-aggregates">Example 2: Auto-switch between daily, hourly, and 10-minute aggregates</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#example-2-auto-switch-between-daily-hourly-and-10-minute-aggregates"></a></p><p>Querying only from continuous aggregates allows us to speed up our dashboards even further. You might not want to directly query the hypertable that houses your raw data, as the queries may be slower due to things like new data being inserted into the hypertable.</p><p>The following example shows a query for switching between aggregations of different granularity without using the raw data hypertable at all (unlike Example 1, which does on-the-fly rollups of raw data).</p><p>First, let’s create 10-minute rollups of the raw data:</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW rides_10mins
WITH (timescaledb.continuous)
AS
    SELECT time_bucket('10 minutes', pickup_datetime) AS bucket, COUNT(*) AS ride_count
    FROM rides
    GROUP BY bucket
WITH NO DATA;

SELECT add_continuous_aggregate_policy('rides_10mins',
    start_offset =&gt; INTERVAL '1 month',
    end_offset =&gt; INTERVAL '10 minutes',
    schedule_interval =&gt; INTERVAL '10 minutes');</code></pre><p><em>Query to create 10-minute rollups of data in a continuous aggregate</em></p><p><strong>Switching between daily aggregation, hourly aggregation, and minute aggregations (no raw data involved)</strong></p><pre><code class="language-SQL">-- Use Daily aggregate for intervals greater than 14 days
SELECT day as time, ride_count, 'daily' AS metric
FROM rides_daily
WHERE  $__timeTo()::timestamp - $__timeFrom()::timestamp &gt; '14 days'::interval AND  $__timeFilter(day)
UNION ALL
-- Use hourly aggregate for intervals between 3 and 14 days
SELECT hour, ride_count, 'hourly' AS metric
FROM rides_hourly
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp BETWEEN '3 days'::interval AND '14 days'::interval AND  $__timeFilter(hour)
UNION ALL
-- Use 10-minute aggregate for intervals between 0 and 3 days
SELECT bucket, ride_count, '10min' AS metric
FROM rides_10mins
WHERE $__timeTo()::timestamp - $__timeFrom()::timestamp &lt; '3 days'::interval AND  $__timeFilter(bucket)
ORDER BY 1; </code></pre><p><em>Query to switch between daily aggregation, hourly aggregation, and per-minute aggregations, all using continuous aggregates</em></p><p>In this post, we saw how to use&nbsp;<code>UNION ALL</code>&nbsp;to automatically switch which aggregate view we’re querying on based on the time interval selected so that we can do more efficient drill downs and make Grafana faster</p><p>You can find more information about the&nbsp;<code>UNION ALL</code>&nbsp;function and how it works in this&nbsp;<a href="https://www.postgresqltutorial.com/postgresql-union/" rel="nofollow">PostgreSQL tutorial</a>—from the aptly named PostgreSQLtutorial.com—and&nbsp;<a href="https://www.postgresql.org/docs/current/queries-union.html" rel="nofollow">official PostgreSQL documentation</a>.</p><p>That’s it! You can modify this code to change the aggregates you query, time intervals, and the metrics you want to visualize to suit your needs and projects.</p><p>Happy auto-switching!</p><h2 id="next-steps">Next Steps</h2><p>In this tutorial, we learned how to use PostgreSQL&nbsp;<code>UNION ALL</code>&nbsp;to solve a common Grafana issue: slow-loading dashboards when we want to query fine-grained raw data (like millisecond performance metrics).</p><p>The result: you create graphs that enable you to switch between different aggregations of your data automatically. This allows you to quickly drill down into your metrics, saving time&nbsp;<em>and</em>&nbsp;CPU resources!</p><p>For more resources&nbsp;<a href="https://www.tigerdata.com/blog/slow-grafana-performance-learn-how-to-fix-it-using-downsampling/" rel="nofollow">to speed up Grafana performance, learn how you can fix slow dashboards using downsampling</a>.</p><h3 id="learn-more">Learn more</h3><p><a href="https://gist.github.com/antonum/94a8c6579a5ddb379588c153504f5472#learn-more"></a></p><p>Want more Grafana tips? Explore our&nbsp;<a href="https://www.tigerdata.com/docs/tutorials/latest/grafana/" rel="nofollow">Grafana tutorials</a>.</p><p>Need a database to power your dashboarding and data analysis?&nbsp;<strong>Get started with a free</strong>&nbsp;<a href="https://console.cloud.timescale.com/signup" rel="nofollow"><strong>Tiger Cloud account</strong></a><strong>.</strong></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Analyze Cryptocurrency Market Data using TimescaleDB, PostgreSQL and Tableau: a Step-by-Step Tutorial]]></title>
            <description><![CDATA[This tutorial is a step-by-step guide on how to analyze a time-series cryptocurrency dataset using Postgres, TimescaleDB and Tableau]]></description>
            <link>https://www.tigerdata.com/blog/tutorials-how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/tutorials-how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial</guid>
            <category><![CDATA[Tutorials]]></category>
            <category><![CDATA[Tableau]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Thu, 19 Sep 2019 19:55:13 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2019/09/Hero-1-1.jpg">
            </media:content>
            <content:encoded><![CDATA[<p>This tutorial is a step-by-step guide on how to analyze a time-series cryptocurrency dataset using Postgres, TimescaleDB and Tableau. The instructions in this tutorial were used to create this <a href="https://timescale.ghost.io/blog/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb/">analysis of 4100+ cryptocurrencies</a>.</p><h2 id="overview-of-steps">Overview of steps</h2><p><strong>Step 0: Install TimescaleDB via Timescale Cloud: </strong>We’ll create a Timescale Cloud account and spin up a TimescaleDB instance.</p><p><strong>Step 1: Design the database schema: </strong>We’ll guide you through how to design a schema for cryptocurrency data to use with TimescaleDB.</p><p><strong>Step 2: Create a dataset to analyze: </strong>We’ll use the CryptoCompareAPI and Python to create a CSV file containing the data to analyze.</p><p><strong>Step 3: Load dataset into TimescaleDB: </strong>We’ll insert the data from the CSV file into TimescaleDB using pgAdmin.</p><p><strong>Step 4: Query the data in TimescaleDB: </strong>We’ll connect our data in TimescaleDB to Tableau and perform queries on the dataset.</p><p><strong>Step 5: Visualize the results: </strong>We’ll use Tableau in order to visualize the results from our queries.</p><p>You can download all files and code used in this analysis in this <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial">Github repo</a>. Note that the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset provided</a> tracks OHLCV price data on 4198 different cryptocurrencies (courtesy of <a href="https://www.cryptocompare.com/">CryptoCompare</a>) as of 9/16/2019. Should you follow the steps correctly, your dataset will be up to the date that you perform the analysis.</p><h2 id="step-0-install-timescaledb-via-timescale-cloud">Step 0: Install TimescaleDB via Timescale Cloud</h2><p>Go to <a href="http://www.timescale.com/cloud">www.timescale.com/cloud</a> and sign up for a free trial, where you will receive $300 in credits, to use a cloud-hosted and managed version of TimescaleDB. This is the easiest way to install the DB. If you prefer, you can <a href="https://docs.timescale.com/latest/getting-started/installation">install an instance yourself on your machine by following these instructions</a>. However, the instructions in this post will assume you’re using Timescale Cloud.</p><p>After you’ve created an account, log-in and create a database instance (you can name it something like “crypto_database”). Then select your prefered configuration (dev-only should be enough for this analysis). After successfully creating the database instance, you should see it active.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/1.-New-Db-instance-active.png" class="kg-image" alt="" loading="lazy"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Fig 1: Timescale Cloud page showing an active TSDB instance</em></i></figcaption></figure><p>Once the instance is active, navigate to “Databases” and create a new database. I’ve called mine ‘crypto-test’. You should see it in the list of Databases after its been created.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/3.Create-New-Database.png" class="kg-image" alt="" loading="lazy"><figcaption><i><em class="italic" style="white-space: pre-wrap;">Fig 2: Successful creation of a database in a Timescale Cloud instance</em></i></figcaption></figure><p></p><h2 id="step-1-design-the-database-schema">Step 1: Design the database schema</h2><p>Now that our database is up and running we need some data to insert into it. Before we get data for analysis, we first need to define what kind of data we want to perform queries on. (To skip ahead, see the code in <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/schema.sql">schema.sql</a>)</p><p>In our analysis, we have two main goals.</p><ol><li>We want to explore the price of Bitcoin and Ethereum, expressed in different fiat currencies, over time.</li><li>We want to explore the price of different cryptocurrencies, expressed in Bitcoin, over time.</li></ol><p>Examples of questions we might want to ask are:</p><ul><li>How has Bitcoin’s price in USD varied over time?</li><li>How has Ethereum’s price in ZAR varied over time?</li><li>How has Bitcoin’s trading volume in KRW increased or decreased over time?</li><li>Which crypto has highest trading volume in last two weeks?</li><li>Which day was Bitcoin most profitable?</li><li>Which are the most profitable new coins from the past 3 months?</li></ul><p>Understanding the questions required of the data leads us to define a schema for our database, so that we can acquire the necessary data to populate it.<br>Our requirements leads us to 4 tables, specifically, three TimescaleDB <a href="https://docs.timescale.com/latest/using-timescaledb/hypertables">hypertables</a>, <em>btc_prices, crypto_prices </em>and<em> eth_prices</em>, and 1 relational table, <em>currency_info</em>.</p><p>The btc_prices hypertable contains data about Bitcoin prices in 17 different fiat currencies since 2010:</p><table>
<thead>
<tr>
<th style="text-align:center">btc_prices hypertable schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">time</td>
<td>The day-specific timestamp of the price records, with time given as the default 00:00:00+00</td>
</tr>
<tr>
<td style="text-align:center">opening_price</td>
<td>The first price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">highest_price</td>
<td>The highest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">lowest_price</td>
<td>The lowest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">closing_price</td>
<td>The last price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">volume_btc</td>
<td>The volume exchanged in the cryptocurrency value that day, in BTC.</td>
</tr>
<tr>
<td style="text-align:center">volume_currency</td>
<td>The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>Corresponds to the fiat currency used for non-btc prices/volumes.</td>
</tr>
</tbody>
</table>
<p><br>Similar to btc_prices, the eth_prices hypertable contains data about Ethereum prices in 17 different fiat currencies since 2015:</p><table>
<thead>
<tr>
<th style="text-align:center">eth_prices hypertable schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">time</td>
<td>The day-specific timestamp of the price records, with time given as the default 00:00:00+00</td>
</tr>
<tr>
<td style="text-align:center">opening_price</td>
<td>The first price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">highest_price</td>
<td>The highest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">lowest_price</td>
<td>The lowest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">closing_price</td>
<td>The last price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">volume_eth</td>
<td>The volume exchanged in the cryptocurrency value that day, in ETH.</td>
</tr>
<tr>
<td style="text-align:center">volume_currency</td>
<td>The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>Corresponds to the fiat currency used for non-ETH prices/volumes.</td>
</tr>
</tbody>
</table>
<p>The crypto_prices hypertable contains data about 4198 cryptocurrencies, including bitcoin and the corresponding crypto/BTC exchange rate, since 2012 or so.</p><table>
<thead>
<tr>
<th style="text-align:center">crypto_prices hypertable schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">time</td>
<td>The day-specific timestamp of the price records, with time given as the default 00:00:00+00</td>
</tr>
<tr>
<td style="text-align:center">opening_price</td>
<td>The first price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">highest_price</td>
<td>The highest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">lowest_price</td>
<td>The lowest price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">closing_price</td>
<td>The last price at which the coin was exchanged that day</td>
</tr>
<tr>
<td style="text-align:center">volume_eth</td>
<td>The volume exchanged in the cryptocurrency value that day, in ETH.</td>
</tr>
<tr>
<td style="text-align:center">volume_currency</td>
<td>The volume exchanged in its converted value for that day, quoted in the corresponding fiat currency.</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>Corresponds to the fiat currency used for non-ETH prices/volumes.</td>
</tr>
</tbody>
</table>
<p>Lastly, we have the currency_info hypertable, which maps the currency’s code to its name.</p><table>
<thead>
<tr>
<th style="text-align:center">currency_info table schema</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center">Field</td>
<td>Description</td>
</tr>
<tr>
<td style="text-align:center">currency_code</td>
<td>2-7 character abbreviation for currency. Used in other hypertables</td>
</tr>
<tr>
<td style="text-align:center">currency</td>
<td>English name of currency</td>
</tr>
</tbody>
</table>
<p>Once we’ve established the schema for the tables in our database, we can formulate create_table SQL statements to actually create the tables we need:</p><p>Code from <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/schema.sql"><em>schema.sql</em></a>:</p><pre><code class="language-SQL">--Schema for cryptocurrency analysis
DROP TABLE IF EXISTS "currency_info";
CREATE TABLE "currency_info"(
   currency_code   VARCHAR (10),
   currency        TEXT
);

--Schema for btc_prices table
DROP TABLE IF EXISTS "btc_prices";
CREATE TABLE "btc_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_btc      DOUBLE PRECISION,
   volume_currency DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Schema for crypto_prices table
DROP TABLE IF EXISTS "crypto_prices";
CREATE TABLE "crypto_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_crypto   DOUBLE PRECISION,
   volume_btc      DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Schema for eth_prices table
DROP TABLE IF EXISTS "eth_prices";
CREATE TABLE "eth_prices"(
   time            TIMESTAMP WITH TIME ZONE NOT NULL,
   opening_price   DOUBLE PRECISION,
   highest_price   DOUBLE PRECISION,
   lowest_price    DOUBLE PRECISION,
   closing_price   DOUBLE PRECISION,
   volume_eth      DOUBLE PRECISION,
   volume_currency DOUBLE PRECISION,
   currency_code   VARCHAR (10)
);

--Timescale specific statements to create hypertables for better performance
SELECT create_hypertable('btc_prices', 'time', 'opening_price', 2);
SELECT create_hypertable('eth_prices', 'time', 'opening_price', 2);
SELECT create_hypertable('crypto_prices', 'time', 'currency_code', 2);</code></pre><p>Notice that we include 3 create_hypertable statements which are special TimescaleDB statements. For more on hypertables, see the <a href="https://docs.timescale.com/latest/using-timescaledb/hypertables">Timescale docs</a> and this <a href="https://timescale.ghost.io/blog/blog/when-boring-is-awesome-building-a-scalable-time-series-database-on-postgresql-2900ea453ee2/">blog post</a>.</p><h2 id="step-2-create-a-dataset-to-analyze">Step 2: Create a dataset to analyze</h2><p>Now that we’ve defined the data we want, it’s time to construct a dataset containing that data. To do this, we’ll write a small python script (to skip ahead see <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/crypto_data_extraction.py">crypto_data_extraction.py</a>) for extracting data from <a href="https://www.cryptocompare.com/">cryptocompare.com</a> into 4 csv files (coin_names.csv, crypto_prices.csv, btc_prices.csv and eth_prices.csv).<br></p><p>In order to get data from cryptocompare, you’ll need to obtain an <a href="https://min-api.cryptocompare.com/pricing">API key</a>. For this analysis, the free key should be plenty.</p><p>The script consists of 5 parts:<br><strong>(1) Setup: </strong>First, we need to import some libraries to help us parse the data. Notably, we will use the python ‘requests’ library, which make it easy to deal with JSON data from a web API endpoint.</p><pre><code class="language-python">import requests
import json
import csv
from datetime import datetime</code></pre><p>Moreover you’ll need your <a href="https://min-api.cryptocompare.com/pricing">CryptoCompare API key</a> as a variable. We’ve just included it as a normal variable in the code below (this is not recommended for production code) but you can store it as an environment variable or follow whatever production security practices for API key management you usually do.<br></p><pre><code class="language-python">apikey = 'YOUR_CRYPTO_COMPARE_API_KEY'
#attach to end of URLstring
url_api_part = '&amp;api_key=' + apikey</code></pre><p><br><strong>(2) Get a list of all coin names to populate table <em>currency_info: </em></strong>First we use the Python requests library’s <em>get function</em> to get a JSON object containing the list of coins names and symbols on CryptoCompare. Then we convert the data to a dictionary form and write information about the coin to the csv file ‘coin_names.csv’.</p><pre><code class="language-python">#####################################################################
#2. Populate list of all coin names
#####################################################################
#URL to get a list of coins from cryptocompare API
URLcoinslist = 'https://min-api.cryptocompare.com/data/all/coinlist'

#Get list of cryptos with their symbols
res1 = requests.get(URLcoinslist)
res1_json = res1.json()
data1 = res1_json['Data']
symbol_array = []
cryptoDict = dict(data1)

#write to CSV
with open('coin_names.csv', mode = 'w') as test_file:
   test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
   for coin in cryptoDict.values():
       name = coin['Name']
       symbol = coin['Symbol']
       symbol_array.append(symbol)
       coin_name = coin['CoinName']
       full_name = coin['FullName']
       entry = [symbol, coin_name]
       test_file_writer.writerow(entry)
print('Done getting crypto names and symbols. See coin_names.csv for result')</code></pre><p><strong>(3) Get historical BTC prices for 4198 other cryptos to populate <em>crypto_prices: </em></strong>Once we have the list of all the coin names, we can iterate through them and pull their historical prices in BTC since they listed on CryptoCompare. We write that information to the CSV file “crypto_prices.csv”.</p><pre><code class="language-python">#####################################################################
#3. Populate historical price for each crypto in BTC
#####################################################################
#Note: this part might take a while to run since we're populating data for 4k+ coins
#counter variable for progress made
progress = 0
num_cryptos = str(len(symbol_array))
for symbol in symbol_array:
   # get data for that currency
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym='+ symbol +'&amp;tsym=BTC&amp;allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('crypto_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, symbol]
           test_file_writer.writerow(entry)
   progress = progress + 1
   print('Processed ' + str(symbol))
   print(str(progress) + ' currencies out of ' +  num_cryptos + ' written to csv')
print('Done getting price data for all coins. See crypto_prices.csv for result')</code></pre><p>Notice how the fields defined in Step 1 influence the choice of what data we write to the CSV file!</p><p><strong>(4) Get historical Bitcoin prices in different fiat currencies to populate <em>btc_prices: </em></strong>We then create a list of different fiat currencies in which we want to express Bitcoin’s price. Unfortunately CryptoCompare doesn’t have a comprehensive list so we’ve hard coded (gasp!) a list of 17 popular fiat currencies.</p><p>We then iterate over the list of fiat currencies and pull the historical Bitcoin price in that currency and write it to the CSV file “btc_prices.csv”.</p><pre><code class="language-python">#####################################################################
#4. Populate BTC prices in different fiat currencies
#####################################################################
# List of fiat currencies we want to query
# You can expand this list, but CryptoCompare does not have
# a comprehensive fiat list on their site
fiatList = ['AUD', 'CAD', 'CNY', 'EUR', 'GBP', 'GOLD', 'HKD',
'ILS', 'INR', 'JPY', 'KRW', 'PLN', 'RUB', 'SGD', 'UAH', 'USD', 'ZAR']

#counter variable for progress made
progress2 = 0
for fiat in fiatList:
   # get data for bitcoin price in that fiat
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym=BTC&amp;tsym='+fiat+'&amp;allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('btc_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, fiat]
           test_file_writer.writerow(entry)
   progress2 = progress2 + 1
   print('processed ' + str(fiat))
   print(str(progress2) + ' currencies out of  17 written')
print('Done getting price data for btc. See btc_prices.csv for result')</code></pre><p><br><strong>(5) Get historical Ethereum prices in different fiat currencies to populate <em>eth_prices: </em></strong>Lastly, we do the same for Ethereum and the list of fiat currencies and write the results in the CSV file “eth_prices.csv”.</p><pre><code class="language-python">#####################################################################
#5. Populate ETH prices in different fiat currencies
#####################################################################
#counter variable for progress made
progress3 = 0
for fiat in fiatList:
   # get data for bitcoin price in that fiat
   URL = 'https://min-api.cryptocompare.com/data/histoday?fsym=ETH&amp;tsym='+fiat+'&amp;allData=true' + url_api_part
   res = requests.get(URL)
   res_json = res.json()
   data = res_json['Data']
   # write required fields into csv
   with open('eth_prices.csv', mode = 'a') as test_file:
       test_file_writer = csv.writer(test_file, delimiter = ',', quotechar = '"', quoting=csv.QUOTE_MINIMAL)
       for day in data:
           rawts = day['time']
           ts = datetime.utcfromtimestamp(rawts).strftime('%Y-%m-%d %H:%M:%S')
           o = day['open']
           h = day['high']
           l = day['low']
           c = day['close']
           vfrom = day['volumefrom']
           vto = day['volumeto']
           entry = [ts, o, h, l, c, vfrom, vto, fiat]
           test_file_writer.writerow(entry)
   progress3 = progress3 + 1
   print('processed ' + str(fiat))
   print(str(progress3) + ' currencies out of  17 written')
print('Done getting price data for eth. See eth_prices.csv for result')</code></pre><p>If you’d rather not pull a fresh data set, you’re welcome to use the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset we already created</a>, but note that it only has data until 9/16/2019.</p><h2 id="step-3-load-dataset-into-timescaledb-using-timescale-cloud-and-pgadmin">Step 3: Load dataset into TimescaleDB, using Timescale Cloud and pgAdmin</h2><p>After following Step 2, you should have 4 CSV files (if not download them <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">here</a>).</p><p>The next step is to load this data into TimescaleDB in order to query it and perform our analysis. In Step 0, we created a TimescaleDB instance in Timescale Cloud, so all that’s left is to use pgAdmin to create the tables from Step 1 and transfer data from each csv file to the relevant table.</p><p><strong>3.1 Connect to your TimescaleDB instance</strong></p><p>Download and install <a href="https://www.pgadmin.org/">pgAdmin</a>, or your favorite postgres admin tool (utilities like <a href="http://postgresguide.com/utilities/psql.html">psql</a> also work). Once installed, login to your database using the credentials on the ‘Overview page’ of your Timescale Cloud instance as shown in Fig 3.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/2.-TSDB-Cloud-overview-creds-page-1.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 3: Timescale Cloud ‘Overview’ page to find credentials to login</span></figcaption></figure><p>Once logged in, you should see something like Fig 4 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/4.-Connected-to-TSDB.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 4: Successful login to Timescale Cloud database in pgAdmin!</span></figcaption></figure><p></p><p><strong>3.2 Use the SQL code from Step 1 to create tables</strong></p><p>Now all our hard work in Step 1 comes in handy! Use the Query Tool in pgAdmin to create the tables we defined in Step 1. One way to find the tool is to navigate to your_project_name -&gt; Databases-&gt; your_db_name, then right click and select Query Tool, as Fig 5 shows below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/QUERY-TOOL.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 5: Locating the Query Tool in pgAdmin</span></figcaption></figure><p>Next, we copy and paste the code from Step 1 (found in <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/schema.sql">schema.sql</a>) into the Query Editor and run the query. This creates the 4 tables with the necessary fields, as well as the 3 hypertables for btc_prices, eth_prices and crypto_prices.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Example-query.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 6: Run the code from schema.sql in the Query Tool in pgAdmin to create the necessary tables</span></figcaption></figure><p>To check our query was successful, look at the output in the Data Output pane or navigate down to your_project_name -&gt; databases-&gt; your_db_name -&gt; schemas -&gt; tables and you should see your 4 tables.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Table-created-.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 7: Successful creation of tables for analysis in pgAdmin</span></figcaption></figure><p><strong>3.3 Import the data into the database tables</strong></p><p>Now that we’ve created the tables with our desired schema, all that’s left is to insert the data from the CSV files we’ve created into the tables.</p><p>We will do this using the Import tool in pgAdmin, but you could use psql or for better performance on large datasets, the <a href="https://github.com/timescale/timescaledb-parallel-copy">TimescaleDB parallel-copy- tool</a>. However, for files of the size we’ve created the pgAdmin import tool works just fine.</p><p>To import a CSV file to a table, navigate down to your_project_name -&gt; databases-&gt; your_db_name -&gt; schemas -&gt; tables and right click on the table you’d like to insert data into and then select ‘Import/Export’ from the menu, as shown in Fig 8 below. We’ll first insert data into the btc_prices table. </p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Import-1.png" class="kg-image" alt="" loading="lazy"></figure><p>Once you’ve selected Import/Export, select Import, then select the csv file from your directory (in this case, btc_prices.csv), and then select comma (,) as the delimiter, as shown in Fig 9 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Import-2.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 9: Importing btc_prices.csv into the btc_prices table using pgAdmin</span></figcaption></figure><p>To check if this worked, right click on btc_prices table, select ‘view/edit data’ -&gt; ‘all rows’, as shown in Fig 10. Fig 11 shows that our data has successfully been inserted into the btc_prices table.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Check-1.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 10: Checking that data has been inserted into the table in pgAdmin</span></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Check-2.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 11: Verifying that btc price data is in fact in btc_prices table using pgAdmin</span></figcaption></figure><p>Repeat these steps above for crypto_prices.csv and the crypto_prices table, eth_prices.csv and the eth_prices table and coin_names.csv and the currency_info table, respectively.</p><h2 id="step-4-query-the-data"><br>Step 4: Query the Data</h2><p>With the data needed for our analysis now sitting snugly in our database tables, we can now perform queries on our dataset in order to answer some of the questions posed in Step 1.</p><p>The code below (from <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/crypto_queries.sql">crypto_queries.sql</a>) contains a sample list of questions and corresponding queries to answer those questions. Of course, you can add in your own questions and create Postgres queries to answer them in addition to, or in place of, the questions and queries provided.</p><p>From <a href="https://github.com/timescale/examples/blob/master/crypto_tutorial/crypto_queries.sql">crypto_queries.sql</a>:</p><pre><code class="language-SQL">-Query 1
-- How did Bitcoin price in USD vary over time?
-- BTC 7 day prices
SELECT time_bucket('7 days', time) as period,
      last(closing_price, time) AS last_closing_price
FROM btc_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period

--Query 2
-- How did BTC daily returns vary over time?
-- Which days had the worst and best returns?
-- BTC daily return
SELECT time,
      closing_price / lead(closing_price) over prices AS daily_factor
FROM (
  SELECT time,
         closing_price
  FROM btc_prices
  WHERE currency_code = 'USD'
  GROUP BY 1,2
) sub window prices AS (ORDER BY time DESC)

--Query 3
-- How did the trading volume of Bitcoin vary over time in different fiat currencies?
-- BTC volume in different fiat in 7 day intervals
SELECT time_bucket('7 days', time) as period,
      currency_code,
      sum(volume_btc)
FROM btc_prices
GROUP BY currency_code, period
ORDER BY period

-- Q4
-- How did Ethereum (ETH) price in BTC vary over time?
-- ETH prices in BTC in 7 day intervals
SELECT
   time_bucket('7 days', time) AS time_period,
   last(closing_price, time) AS closing_price_btc
FROM crypto_prices
WHERE currency_code='ETH'
GROUP BY time_period
ORDER BY time_period

--Q5
-- How did ETH prices, in different fiat currencies, vary over time?
-- (using the BTC/Fiat exchange rate at the time)
-- ETH prices in fiat
SELECT time_bucket('7 days', c.time) AS time_period,
      last(c.closing_price, c.time) AS last_closing_price_in_btc,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'USD') AS last_closing_price_in_usd,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'EUR') AS last_closing_price_in_eur,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'CNY') AS last_closing_price_in_cny,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'JPY') AS last_closing_price_in_jpy,
      last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'KRW') AS last_closing_price_in_krw
FROM crypto_prices c
JOIN btc_prices b
   ON time_bucket('1 day', c.time) = time_bucket('1 day', b.time)
WHERE c.currency_code = 'ETH'
GROUP BY time_period
ORDER BY time_period

--Q6
--Crypto by date of first data
SELECT ci.currency_code, min(c.time)
FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
AND c.closing_price &gt; 0
GROUP BY ci.currency_code
ORDER BY min(c.time) DESC

--Q7
-- Number of new cryptocurrencies by day
-- Which days had the most new cryptocurrencies added?
SELECT day, COUNT(code)
FROM (
  SELECT min(c.time) AS day, ci.currency_code AS code
  FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
  AND c.closing_price &gt; 0
  GROUP BY ci.currency_code
  ORDER BY min(c.time)
)a
GROUP BY day
ORDER BY day DESC


--Q8
-- Which cryptocurrencies had the most transaction volume in the past 14 days?
--Crypto transaction volume during a certain time period
SELECT 'BTC' as currency_code,
       sum(b.volume_currency) as total_volume_in_usd
FROM btc_prices b
WHERE b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
GROUP BY b.currency_code
UNION
SELECT c.currency_code as currency_code,
       sum(c.volume_btc) * avg(b.closing_price) as total_volume_in_usd
FROM crypto_prices c JOIN btc_prices b ON date(c.time) = date(b.time)
WHERE c.volume_btc &gt; 0
AND b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
AND now() - date(c.time) &lt; INTERVAL '14 day'
GROUP BY c.currency_code
ORDER BY total_volume_in_usd DESC

--Q9
--Which cryptocurrencies had the top daily return?
--Top crypto by daily return
WITH
   prev_day_closing AS (
SELECT
   currency_code,
   time,
   closing_price,
   LEAD(closing_price) OVER (PARTITION BY currency_code ORDER BY TIME DESC) AS prev_day_closing_price
FROM
    crypto_prices  
)
,    daily_factor AS (
SELECT
   currency_code,
   time,
   CASE WHEN prev_day_closing_price = 0 THEN 0 ELSE closing_price/prev_day_closing_price END AS daily_factor
FROM
   prev_day_closing
)
SELECT
   time,
   LAST(currency_code, daily_factor) as currency_code,
   MAX(daily_factor) as max_daily_factor
FROM
   daily_factor
GROUP BY
   TIME</code></pre><p>For this step and in Step 5, we’ll use <a href="https://www.tableau.com/">Tableau</a> to run the above queries on the dataset and visualize the output. You’re welcome to use other data visualization tools like <a href="https://grafana.com/blog/2018/10/15/make-time-series-exploration-easier-with-the-postgresql/timescaledb-query-editor/">Grafana</a>, but ensure that the tool you’ve selected has a Postgres connector.</p><p>The following steps are to query the data using <a href="https://www.tableau.com/">Tableau</a>:<br><strong>4.1 Create a new workbook: </strong>This will be used to house all the graphs for the analysis.</p><p><strong>4.2 Connect TimescaleDB to Tableau: </strong>Create a connection between Tableau and TimescaleDB running in your Timescale Cloud instance.</p><p>Connecting your TimescaleDB instance in the cloud to Tableau takes just a few clicks, thanks to Tableau’s built in Postgres connector. To connect to your database add a new connection and under the ‘to a server’ section, select PostgreSQL as the connection type. Then enter your database credentials (found in the Timescale Cloud ‘Overview’ tab) like we did in Step 3.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/New-Data-Source-T.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 12: Adding a connection from Timescale Cloud in Tableau</span></figcaption></figure><p><strong>4.3 Create a new data source: </strong>Create a new datasource and rename it to be something unique, as by default it’s the name of your database.</p><p>You’ll need to do this for each query the analysis, since each data source only supports one piece of custom SQL. A way to create many data sources with the same database is to create one in the way described above, duplicate it and then change the custom SQL used each time, since the database you’re connecting to remains the same.</p><p><strong>4.4 Query the data: </strong>Here we’ll use Tableau and the built in SQL editor. To run a query, add custom SQL to your data source by dragging and dropping the “New Custom SQL” button to the place that says ‘Add tables here’.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/New-Custom-SQL-T.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 13: Adding Custom SQL to a data source in Tableau</span></figcaption></figure><p>Once you’ve done that, paste the query you want in the query editor. In the example in Fig 14 below, we’ll use Query 1 from crypto_queries.sql, for historical BTC prices in USD.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Tableau-Query-Tool.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 14: Query for Historical Bitcoin prices in USD in the Tableau query editor</span></figcaption></figure><p><br>Once you’ve entered the query, press OK and then Update Now and you’ll see the results in a table, as illustrated by Fig 15 below.<br></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Query-Success-T.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 15: Results from a successful execution of Query 1 in Tableau</span></figcaption></figure><h2 id="step-5-data-visualization-in-tableau">Step 5: Data Visualization in Tableau</h2><p>Results in a table are only so useful, graphs are much better! So in our final step, let’s take our output from Step 4 and turn it into an interactive graph in Tableau.</p><p>To do this, create a new worksheet (or dashboard) and then select your desired data source (in our case ‘btc 7 days’), shown in Fig 16 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/New-worksheet-with-data-source-1.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 16: A new Tableau worksheet linked to the data source ‘btc 7 days’</span></figcaption></figure><p>Next, you locate the Dimensions and Measures pane on the left. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/dimensions-and-measures.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 17: The dimensions and measures pane in Tableau</span></figcaption></figure><p>Then, drag the period (time) dimension to ‘Columns’ part of sheet and then the ‘last closing price’ measure to the rows part of the worksheet. You should see something like the graph shown in Fig 18 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/graph-before-time-adj.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 18: Initial graph after dragging dimensions and measures in Tableau</span></figcaption></figure><p>Now this graph doesn’t quite have the level of fidelity we’re looking for because the data points are being grouped by year. To fix this, click on the drop down arrow on period and select ‘exact date’. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/how-to-adjust.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 19: Finding the exact date setting on a dimension in Tableau</span></figcaption></figure><p>This undoes the grouping by year and matches the price datapoint to the exact date that price occurred on. Your group should now look like Fig 20 below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/graph-after-time-adjustment.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 20: Graph after selecting correct setting for ‘period’ in Tableau</span></figcaption></figure><p>From there you can edit axis labels and colors and even add filter to zoom in on a specific time period. Here’s our final result, with labels added in Fig 21 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/graph-after-final-editing.png" class="kg-image" alt="" loading="lazy"><figcaption><span style="white-space: pre-wrap;">Fig 21: Final graph showing Bitcoin prices in USD from 2010-2019</span></figcaption></figure><p>We encourage you to explore different visualization formats for the results you obtain from queries provided. For inspiration, check out the <a href="https://timescale.ghost.io/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb">different visualizations we used in our analysis</a>.</p><h2 id="conclusion">Conclusion</h2><p>This tutorial showed you step by step one method of defining, creating, loading and analyzing a cryptocurrency market dataset. Here’s a reminder of what we covered:<br></p><ol><li>We created a Timescale Cloud account and spun up a TimescaleDB instance.</li><li>We learned how to design a schema for cryptocurrency data for TimescaleDB and PostgreSQL.</li><li>We used the CryptoCompareAPI and Python to create a CSV file containing the data to analyze</li><li>We inserted the data from the CSV files created into TimescaleDB using pgAdmin and Timescale Cloud.</li><li>We connected our data in TimescaleDB to Tableau and performed queries on the dataset</li><li>We used Tableau to create graphs to visualize the results from our queries</li></ol><p>Thank you for reading this far and if you followed all the steps, congratulations on successfully completing this tutorial! We hope you’ve enjoyed following along and that you’ve found this tutorial helpful. </p><p>Perhaps you’d also enjoy our <a href="https://timescale.ghost.io/blog/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb/">analysis of over 4100 cryptocurrencies produced by following this tutorial</a>.</p><p>For follow up questions or comments, reach out to us on Twitter (<a href="https://twitter.com/timescaledb">@TimescaleDB</a> or <a href="https://twitter.com/avthars">@avthars</a>), our community <a href="http://slack.timescale.com/">Slack channel</a>, or reach out to me directly via email (avthar at timescale dot com).</p><p>Finally, if you’re interested in learning more about us, check out the <a href="https://www.timescale.com/">Timescale website</a>, <a href="https://www.timescale.com/cloud-signup">Timescale Cloud</a>, and see our <a href="https://github.com/timescale/timescaledb">GitHub</a> and let us know how we can help!</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Analyzing Bitcoin, Ethereum, and 4,100+ Other Cryptocurrencies Using PostgreSQL and TimescaleDB]]></title>
            <description><![CDATA[After a Crypto Winter in 2018, cryptocurrencies today are resurging. How can data help us better understand the Crypto Revival?]]></description>
            <link>https://www.tigerdata.com/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/analyzing-bitcoin-ethereum-and-4100-other-cryptocurrencies-using-postgresql-and-timescaledb</guid>
            <category><![CDATA[Product & Engineering]]></category>
            <category><![CDATA[Tutorials]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <dc:creator><![CDATA[Avthar Sewrathan]]></dc:creator>
            <pubDate>Thu, 19 Sep 2019 19:55:07 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2019/09/andre-francois-mckenzie-JrjhtBJ-pGU-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<p><strong><em>After a Crypto Winter in 2018, cryptocurrencies today are resurging. How can data help us better understand the Crypto Revival?</em></strong></p><p>When Satoshi Nakamoto first published the Bitcoin whitepaper in 2008, they probably didn’t foresee the world of <a href="https://www.investopedia.com/terms/h/hodl.asp">hodlers</a>, <a href="https://decryptionary.com/dictionary/lambo/">lambos</a>, <a href="https://www.definitions.net/definition/Buidl">buidlers</a>, <a href="https://www.vice.com/en_us/article/ne74nw/inside-the-world-of-the-bitcoin-carnivores">bitcoin maximalist carnivores</a>, and n00bs asking “<a href="https://www.quora.com/What-does-going-to-the-moon-mean-in-cryptocurrency">wen moon</a>” in telegram channels, that their actions would create. In 11 years, crypto has gone from something completely esoteric to something seemingly everyone has heard about.</p><p>2019 has been a big year for crypto, so far. Some of the highlights include: the <a href="https://www.wsj.com/articles/sec-clears-blockstack-to-hold-first-regulated-token-offering-11562794848">SEC approving it’s first token sale</a>, the <a href="https://www.creditkarma.com/insights/i/irs-crack-down-cryptocurrency-owners/">IRS tracking down crypto tax-evaders</a>, <a href="https://www.thetradenewscrypto.com/vast-majority-endowment-funds-testing-crypto-investments/">university endowments investing in crypto</a> and even <a href="https://techcrunch.com/2019/06/18/facebook-libra/">Facebook announcing it’s own cryptocurrency</a>. We also just this month saw <a href="https://news.ycombinator.com/item?id=20919958">over $1 Billion in Bitcoin transferred in a single transaction</a>. All this, and more, indicates a revived interest in the crypto markets, since the highs of 2017 and lows of 2018, by everyone from institutional investors and banks to lay people trying to side hustle.</p><p>With the crypto markets once again awash with speculation and hype, it’s important to leverage all the tools at our disposal in order to make sense of the noise. Sometimes, reading articles and email newsletters isn’t enough. You have to go directly to the data.</p><p>As the developers of <a href="https://www.timescale.com/">TimescaleDB</a>, an open-source time-series database powered by PostgreSQL, we’re data-driven people. So, we thought it would be interesting to take a <em>data-driven approach</em> to analyzing the crypto market. For this analysis, we used PostgreSQL and TimescaleDB to analyze market data about Bitcoin, Ethereum, and 4,196 other cryptocurrencies and used Tableau to visualize our results.</p><p><strong>This post shares many high-level</strong>,<strong> insights about the crypto market since its inception and during recent years. We answer questions like:</strong><br></p><ul><li>How has the price of Bitcoin and Ethereum changed in the past several years?</li><li>Which new cryptocurrencies have been the most profitable in the past 3 months?</li><li>What are the cryptocurrencies on the rise?</li><li>What was the best day to “day-trade” Bitcoin?</li><li>What countries have the highest trading volume of BTC today?</li><li>Why is Bitcoin a terrible way to pay for pizza?</li></ul><p>...and many more, as we dive into analysis of topics like Bitcoin and Ethereum price, new coin growth, trading volume and daily returns.</p><p>We also share how powerful SQL is as a query language for analyzing time-series data, how TimescaleDB and PostgreSQL further simplify time-series data analysis, and how using the two with Tableau visualizations can surface interesting insights from your data.</p><p>For the technically curious, you can learn how to create the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset</a> we used for this analysis, load it, and draw insights from it, in this <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">companion tutorial post</a>. In the tutorial, you will find <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">step by step instructions</a> on how to <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">create the dataset using Python</a> (including all code we used for the analysis), how to <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">load the data</a> into <a href="https://portal.managed.timescale.com/login">Managed Service for TimescaleDB</a>, a cloud-hosted version of TimescaleDB, and how to connect your database in the cloud to Tableau to <a href="https://timescale.ghost.io/blog/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">recreate the analysis and produce graphs</a>.</p><h2 id="about-the-data-used-for-this-analysis">About the data used for this analysis</h2><p>For this analysis, we used historical <a href="https://en.wikipedia.org/wiki/Open-high-low-close_chart">OHLCV</a> price data for over <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">4100 cryptocurrencies</a> from 7/17/2010 to 9/16/2019, courtesy of <a href="https://www.cryptocompare.com/">CryptoCompare</a> and their <a href="https://min-api.cryptocompare.com/">wonderful API</a>. While the <a href="https://github.com/timescale/examples/tree/master/crypto_tutorial/Cryptocurrency%20dataset%20Sept%2016%202019">dataset we used</a> only includes daily data, TimescaleDB <a href="https://timescale.ghost.io/blog/recap-performant-time-series-data-management-analytics-with-postgres/">easily scales to handle data from much finer grained time periods</a>.</p><p>Some of you may recall that we did a <a href="https://timescale.ghost.io/blog/analyzing-ethereum-bitcoin-and-1200-cryptocurrencies-using-postgresql-3958b3662e51/">similar analysis on crypto back in 2017</a>, but so much has happened since then, including the addition of almost 3000 more cryptocurrencies and Bitcoin hitting nearly ~$20k in price, that we had to revisit this topic. Where applicable, we’ve included graphs to focus on recent history, from 2017-2019, updating the analysis from our previous post.</p><p><em>DISCLAIMER: At Timescale, we help companies harness the power of time series data to make sense of the past, monitor the present, and predict the future. However, nothing in this analysis should be construed as financial advice and we take no liability for your actions as a result of using the information contained in this post. You’re welcome to draw your own conclusions using the tools and data and take your own risks accordingly.</em></p><h2 id="so-if-you%E2%80%99d-invested-100-in-bitcoin-9-years-ago-today-it%E2%80%99d-be-worth%E2%80%A6">So if you’d invested $100 in Bitcoin 9 years ago, today it’d be worth…</h2><p>When analyzing cryptocurrencies, we have to start with the original: Bitcoin. For any beginners, Bitcoin <a href="https://www.luno.com/learn/en/article/bitcoin-as-digital-gold">can be thought of as digital gold</a>, because Bitcoin has built-in scarcity (only 21 million BTC will ever be produced), can be almost infinitely divided without losing its unit value, and is difficult to counterfeit. (As an aside, for those looking for an introduction to Bitcoin and other cryptocurrencies, two good places to start are <a href="https://www.lopp.net/pdf/princeton_bitcoin_book.pdf">The Princeton Bitcoin Book</a> and <a href="https://www.amazon.com/Internet-Money-Andreas-M-Antonopoulos/dp/1537000454">The Internet of Money</a>.)</p><p>Looking at historical BTC-USD prices since 2010, we see that BTC prices have slowly increased, with an almost exponential increase taking place between 2014 and 2018. </p><pre><code class="language-SQL">--Query 1
-- BTC 7 day prices
SELECT time_bucket('7 days', time) as period,      
	last(closing_price, time) AS last_closing_price
FROM btc_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q1.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 1: Bitcoin Closing price in USD from 2010-2019</strong></b></i></figcaption></figure><p>So to answer our original question (and probably create some FOMO), if you bought $100 worth of Bitcoin on 16 September 2010, the price of 1BTC was $0.0619, meaning $100 would have bought you 1615.5088853 BTC. Fast forward 9 years, that $100 would have grown to $16,476,736.67! Queue the lambos! (But of course you would have probably sold when BTC hit $100 back in 29 July 2013 :)).</p><p><em>Generating insights like this from time-series data takes no more than a 5 line SQL query thanks to TimescaleDB’s special </em><a href="https://docs.timescale.com/latest/api#time_bucket"><em>timebucket</em></a><em> and </em><a href="https://docs.timescale.com/latest/api#last"><em>last</em></a><em> functions, which are special functions exclusively created for TimescaleDB to simplify time-series analysis.</em></p><h2 id="why-bitcoin-is-a-terrible-way-to-pay-for-pizza">Why Bitcoin is a terrible way to pay for pizza</h2><p>One thing that’s evident from the data is that Bitcoin is volatile. Bitcoin’s price volatility may mean that its main use case is being a store of value rather than a means of exchange. The <a href="https://finance.yahoo.com/news/bitcoin-pizza-day-celebrating-world-100035402.html">Bitcoin Pizza Guy</a>, who paid 10,000 BTC for a 2 pizzas back in 2010, would probably agree, as those 10,000BTC are worth over a $100,000,000 today!</p><p>Moreover, the time period between 2017 and 2019 saw a lot of ups and downs, so let’s zoom in on that period below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q1-Zoom.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 2: Bitcoin Closing price in USD from July 2017 to July 2019</strong></b></i></figcaption></figure><p>As Figure 2 shows, 2017-2019 has arguably been the most exciting and perhaps also the most painful time in the Bitcoin market, with Bitcoin prices reaching a high of nearly $20,000, before crashing to under $7,000 three months later. That decline continued throughout 2018, with some calling it the first crypto bear market and others a “Crypto Winter”. However, the recent run in price made by Bitcoin since April 2019 may be the first indication of the end of the Crypto Winter, with BTC prices reaching over $12,000 in June 2019.</p><h2 id="the-best-day-to-%E2%80%9Cday-trade%E2%80%9D-bitcoin-february-26-2014">The best day to “day-trade” Bitcoin? February 26, 2014.</h2><p>In order to better understand Bitcoin’s volatility, let’s look at the daily return on BTC, as a factor of the previous day’s rate. This is simple to do using <a href="https://www.postgresql.org/docs/9.3/functions-window.html">PostgreSQL window functions</a>, as shown in the query below.</p><pre><code class="language-SQL">--Query 2
-- BTC daily return
SELECT time,
	closing_price / lead(closing_price) over prices AS daily_factor
FROM (  
 SELECT time,         
  closing_price  
FROM btc_prices  
WHERE currency_code = 'USD'  
GROUP BY 1,2
) sub window prices AS (ORDER BY time DESC)</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q2.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 3.1: BTC Daily Return from 2010-2019</strong></b></i></figcaption></figure><p>From Figure 3.1, we can see large amounts of volatility in BTC price from 2010, culminating in a huge spike in daily return factor in February 2014. In fact, within just 7 days we saw the day with the lowest daily return factor — 0.428 on 20 February 2014 — and the highest ever daily return factor — 4.368 on 26 February 2014! It’s no wonder that <a href="https://www.washingtonpost.com/business/why-facebook-chose-stablecoins-as-its-path-to-crypto/2019/06/18/2fa7d738-91e7-11e9-956a-88c291ab5c38_story.html?noredirect=on">stablecoins</a>, or price-stable cryptocurrencies are being looked into as alternatives to use as a means of exchange within apps or for everyday transactions. Imagine paying for everything you buy in Bitcoin and the headaches that daily and weekly BTC price volatility causes.</p><p>Once again, let’s zoom in on the 2017-2019 time period:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q2-Zoom.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 3.2: BTC Daily Return from 2017-2019</strong></b></i></figcaption></figure><p>From Figure 3.2, we see that Bitcoin’s most profitable day since the start of 2017 occurred recently on 25 August 2019, with a daily return of 1.99 times the previous day’s rate. The day with the biggest loss went to 16 January 2018, with a daily return factor of 0.8276, with 14 September 2017 coming in second biggest loss, with a daily return factor of 0.8379.</p><h2 id="bitcoin%E2%80%99s-top-countries-by-trading-volume-us-japan-south-korea-and-poland">Bitcoin’s top countries by trading volume: US, Japan, South Korea, and... Poland!?</h2><p>Cryptofever has taken the world by storm, with lots of adoption taking place outside of the USA, most notably in places like Europe and Asia. We can get a sense for how crypto is being adopted in different regions of the world by looking at Bitcoin trading volume in different fiat currencies over time.</p><pre><code class="language-SQL">--Query3
-- BTC trading volumes by currency 
SELECT time_bucket('14 days', time) as period,
	currency_code,       
    	sum(volume_btc)
FROM btc_prices
GROUP BY currency_code, period
ORDER BY period;</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q3.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 4: BTC trading volume in different fiat currencies</strong></b></i></figcaption></figure><p>From figure 4 above, we can see that China saw huge amounts of bitcoin trading volume before government intervention which made buying Bitcoin illegal in mid 2017. We can more clearly see how drastic this effect was by looking at Figure 5 below, which is a bar graph version of Figure 4, showing volume of BTC trade in different fiat by year:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q3--2-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 5: BTC Trading Volume in different fiat currencies by year</strong></b></i></figcaption></figure><p>From figure 5, we can more clearly see the rise in Chinese (CNY) Bitcoin trading activity and how government intervention in 2017 brought that to a halt. Moreover, we can see Japan (JPY) and South Korea (KRW) overtaking Europe with respect to Bitcoin trading volume, with more volume than the Euro (EUR) during 2017 and 2018. This confirms the <a href="https://www.bbc.com/news/business-42713314">USA, Japan and South Korea as the world’s 3 largest bitcoin markets</a>.</p><p>Furthermore, if we remove USD, CNY, JPY, KRW and EUR from the list of fiats we can get a sense for the trend in Bitcoin adoption outside the largest markets, as shown in Figure 6 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q3--5-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 6: BTC Trading Volume in different fiat currencies by year (excluding USD, JPY, KRW, CNY, EUR)</strong></b></i></figcaption></figure><p>Note how South Africa (ZAR) has seen rising BTC trade volumes since 2015, the dramatic increase in trading volume from Hong Kong (HKD) in 2017 and subsequent decrease, and that the currency with the highest trade volume outside the big 5 is none other than the Polish Zloty (PLN)!</p><p>Figure 6.1 shows trading volumes of BTC in PLN since 2014:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q3--6-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 6.1: BTC trading volumes in PLN</strong></b></i></figcaption></figure><h2 id="now-if-you%E2%80%99d-bought-100-worth-of-ethereum-in-2015-today-it%E2%80%99d-be-worth">Now if you’d bought $100 worth of Ethereum in 2015, today it’d be worth...</h2><p>Ethereum is popularly regarded as the cryptocurrency with the second largest interest base after Bitcoin, however, it is fundamentally different from Bitcoin. While Bitcoin is considered to be digital gold, Ether is more like fuel (<a href="https://ethereum.stackexchange.com/questions/3/what-is-meant-by-the-term-gas">gas</a>) that runs transactions on the Ethereum network.</p><p>Since the <a href="https://cointelegraph.com/news/ethereum-raises-3700-btc-in-first-12-hours-of-ether-presale">first currency with which you could buy Ethereum was Bitcoin</a>, let’s take a look at historical ETH prices in BTC, shown in Figure 7 below:</p><pre><code class="language-SQL">-- Q4
-- ETH prices in BTC in 7 day intervals
SELECT  
	time_bucket('7 days', time) AS time_period,   
    last(closing_price, time) AS closing_price_btc
FROM crypto_prices
WHERE currency_code='ETH'
GROUP BY time_period
ORDER BY time_period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q4.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 7: ETH Price in BTC since 2015</strong></b></i></figcaption></figure><p>Figure 7 shows us Ethereum (ETH) closing prices since 3 August 2015 in weekly intervals, expressed in BTC. Notice that 2017 was a rollercoaster year for ETH, with the currency seeing it’s an all time high of 0.1402 BTC on 12 June 2017 and then crashing back down to 0.0288 BTC 4 December 2017, less than 6 months later. </p><p>Let’s take a look at recent Ethereum prices by zooming in on the period since 2017 in Figure 8 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q4-Zoom.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 8: ETH Price in BTC since 1 January 2017</strong></b></i></figcaption></figure><p>From Figure 8 above, it seems that ETH then went on another bull run in early 2018, with prices reaching 0.1052 BTC in the period around 22 January 2018. Since then, it seems like ETH prices have been trending downward, with the price on 16 September 2019 reaching 0.0188 BTC. While that’s not great for investors, it may prove to be a blessing for developers and decentralized application users in Ethereum’s ecosystem, as gas costs would be cheaper, perhaps decreasing the barriers to adoption.</p><h2 id="crypto-convertibles-not-the-car-kind">Crypto Convertibles (not the car kind)</h2><p>Since most people don’t think about prices in BTC (yet), and given how volatile BTC is, it’s useful to also examine ETH prices expressed different fiat currencies. There are two ways this could be done: First by looking at ETH prices directly in different fiat currencies, like we did for for Bitcoin and USD in Figure 1 above. Secondly, we could convert ETH prices in BTC to fiat currency prices, by looking at that day’s BTC to fiat exchange rate. For illustration purposes, we’ll use the second technique. While this may seem like a strange choice, it’s worth noting that conversions from one cryptocurrency to another and then to a fiat currency are fairly common in the cryptocurrency trading world. This is because many exchanges support buying cryptocurrencies with other cryptocurrencies (mainly BTC and ETH), but not all crypto currencies are purchasable directly with fiat currency.</p><p>In order to examine ETH prices in different fiat currencies in PostgreSQL, we joined two tables and used filters, as the code below illustrates:</p><pre><code class="language-SQL">--Q5
-- ETH prices in fiat
SELECT time_bucket('7 days', c.time) AS time_period,  
	last(c.closing_price, c.time) AS last_closing_price_in_btc,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'USD') AS last_closing_price_in_usd, 
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'EUR') AS last_closing_price_in_eur,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'CNY') AS last_closing_price_in_cny,
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'JPY') AS last_closing_price_in_jpy,  
    last(c.closing_price, c.time) * last(b.closing_price, c.time) FILTER (WHERE b.currency_code = 'KRW') AS last_closing_price_in_krw
FROM crypto_prices c
JOIN btc_prices b   
	ON time_bucket('1 day', c.time) = time_bucket('1 day', b.time)
WHERE c.currency_code = 'ETH'
GROUP BY time_period
ORDER BY time_period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q5-USD-EUR.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.1: ETH Price in USD and EUR</strong></b></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q5-JPY.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.2: ETH Price in JPY</strong></b></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q5-CNY.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.3: ETH Price in CNY</strong></b></i></figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q5-KRW.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 9.4: ETH Price in KRW</strong></b></i></figcaption></figure><p>One thing to notice is that the Figures 9.1-9.4 all the same shape, since they are another expression of the ETH price in BTC. The main difference in the graphs is the scale of the Y axis, as this reflects the respective currency’s BTC exchange rate. This is as a result of the decision to use the BTC-Fiat exchange rate for this conversion, rather than direct ETH-Fiat prices. When plotted all on the same axis, we get Figure 10 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q5.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 10: ETH Price in Different Fiat Currencies</strong></b></i></figcaption></figure><p>Fortunately, you can now directly purchase ETH using fiat currency on many exchanges. So let’s look at historical ETH prices in USD,in Figure 11 below:</p><pre><code class="language-SQL">-- ETH prices in USD
SELECT time_bucket('7 days', time) as period,       
	last(closing_price, time) AS last_closing_price
FROM eth_prices
WHERE currency_code = 'USD'
GROUP BY period
ORDER BY period</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q10-ETH-USD.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 11: ETH Price in USD from 2015-2019</strong></b></i></figcaption></figure><p>Figure 11 above tells a similar story to that of Figure 8, as we can see the ETH bull run to $1,359 on 8 January 2018. ETH prices in USD have been trending downward since then, with the price on 16 September 2019 falling to $199.</p><p>So to answer our original question, if you bought $100 worth of Ethereum on 16 September 2015, the price of 1ETH was around $0.89, meaning $100 would have bought you 112.3596 ETH. That $100 would now have grown to $22,359.55! However, the best time to sell would’ve been during the peak Jan 2018, your 112.3596 ETH would’ve been worth $152 696.63! This just goes to show how important it is to time the market!</p><p><em>(One piece of analysis which we didn’t do, but encourage readers to do, is to examine the developer activity in the Ethereum ecosystem (eg Github commits/ issues) and see if that correlates to price in some way.)</em></p><h2 id="tracking-4000-other-cryptocurrencies-starting-from-inception">Tracking 4000+ other cryptocurrencies, starting from inception</h2><p>While we can’t see when exactly coins ICO’d or first got listed on exchanges, we <em>can</em> track the date a coin first got added to CryptoCompare as a proxy for its launch date. This allows us to track the launch of different coins over time, as seen in Figure 11.</p><pre><code class="language-SQL">--Q6
--Crypto by date of first data
SELECT ci.currency_code, min(c.time)
FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code
AND c.closing_price &gt; 0
GROUP BY ci.currency_code
ORDER BY min(c.time) DESC</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q6--Bar-Graph-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 11: Number of new cryptocurrencies launched year</strong></b></i></figcaption></figure><p>It’s easy to conclude that a bull run in BTC prices might have fueled a massive increase in developer activity in the crypto space. The evidence for this comes from the the 738 new cryptos released in 2017, a year where BTC almost 10x’ed its BTC price between Jan(1 BTC = $2,435 on Jan 1) and December (1 BTC = $19,345 on Dec 16). With bitcoin prices steadily increasing, it’s no wonder that hundreds of developers tried their hand at creating their own cryptocurrency in the hopes of maybe building the next Bitcoin.</p><p>However, despite the dramatic crash in BTC prices throughout most of 2018, it’s interesting to see the amount of new cryptocurrencies launched in 2018 year being the highest ever, with 771 new cryptocurrency projects launching that year.</p><p>Figure 12 shows us the result of examining the number of new cryptos launched by day.</p><pre><code class="language-SQL">--Q7
-- Number of new cryptocurrencies by day
SELECT day, COUNT(code)
FROM (  
	SELECT min(c.time) AS day, ci.currency_code AS code  
    FROM currency_info ci JOIN crypto_prices c ON ci.currency_code = c.currency_code  
    AND c.closing_price &gt; 0  
   GROUP BY ci.currency_code  
   ORDER BY min(c.time))a
GROUP BY day
ORDER BY day DESC</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q7.png" class="kg-image" alt="" loading="lazy" width="1600" height="1092" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Q7.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Q7.png 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Q7.png 1600w" sizes="(min-width: 720px) 720px"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 12: Number of new cryptocurrencies launched by day</strong></b></i></figcaption></figure><p>Two days in particular are super interesting. On 2 December 2014, we saw data for 81 cryptocurrencies being added to Cryptocompare and on 26 May 2017, we saw a whopping 134 new cryptos being added on that day!</p><h2 id="these-coins-had-higher-transaction-volumes-than-bitcoin-this-month-hint-it%E2%80%99s-not-bitcoin-cash">These coins had higher transaction volumes than Bitcoin this month (Hint: It’s not Bitcoin Cash)</h2><p>With over 4000+ cryptocurrencies out there and new ones coming out everyday, it can be hard to pick which ones are worth paying attention to. One helpful metric for spotting coins on the rise is the transaction volume. In Figure 11, we looked at the transaction volume for all 4054 coins in our dataset over the 14 day period from September 2 to 16 2019.</p><pre><code class="language-SQL">--Q8
--Crypto transaction volume during a certain time period
SELECT 'BTC' as currency_code,      
	sum(b.volume_currency) as total_volume_in_usd
FROM btc_prices b
WHERE b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
GROUP BY b.currency_code
UNION
SELECT c.currency_code as currency_code,      
	sum(c.volume_btc) * avg(b.closing_price) as total_volume_in_usd
FROM crypto_prices c JOIN btc_prices b ON date(c.time) = date(b.time)
WHERE c.volume_btc &gt; 0
AND b.currency_code = 'USD'
AND now() - date(b.time) &lt; INTERVAL '14 day'
AND now() - date(c.time) &lt; INTERVAL '14 day'
GROUP BY c.currency_code
ORDER BY total_volume_in_usd DESC </code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q8.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 13: Cryptos with the most USD transaction volume over from Sept 2-16 2019</strong></b></i></figcaption></figure><p>It’s surprising to see that Bitcoin, despite being the original cryptocurrency, did not have the largest transaction volume over the time period in question. That honor belonged to USD Tether, <a href="https://en.wikipedia.org/wiki/Tether_(cryptocurrency)">USD Tether</a> -- a fiat backed stablecoin, with Ethereum coming in second and <a href="https://litecoin.org/">Litecoin</a>, the proverbial silver to Bitcoin’s gold, coming in third. Bitcoin (BTC) had the fourth highest USD transaction volume in that 14 day period, followed by <a href="https://www.ripple.com/">Rippple</a> (XRP), a global payments system which has partnered with several banks and payment processors, and <a href="https://en.wikipedia.org/wiki/EOS.IO">EOS</a>, a smart contract platform.</p><h2 id="what-are-the-most-profitable-new-cryptocurrencies">What are the most profitable new cryptocurrencies?</h2><p>Another way of making sense of the flood of new currencies is to look at how profitable coins are, as measured by total daily return. By honing in on the currencies with the highest increase in rate by day, we can gain a different perspective on which currencies might be worth looking into further.</p><p>One question to ask is which cryptocurrency has the highest daily return during a certain time period. Figures 14.1 and 14.2 below show cryptocurrencies sorted by their maximum daily return factor between June 16 and September 16 2019.<br></p><pre><code class="language-SQL">--Q9
--Top crypto by daily return
WITH   
	prev_day_closing AS (
SELECT   
	currency_code,   
    time,   
    closing_price,   
    LEAD(closing_price) OVER (PARTITION BY currency_code ORDER BY TIME DESC) AS prev_day_closing_price
FROM    
	crypto_prices  
),   
	daily_factor AS (
SELECT   
	currency_code,   
	time,   
	CASE WHEN prev_day_closing_price = 0 THEN 0 ELSE closing_price/prev_day_closing_price END AS daily_factor
FROM   
	prev_day_closing
)
SELECT   
	time,   
	LAST(currency_code, daily_factor) as currency_code,   
    MAX(daily_factor) as max_daily_factor
FROM   
	daily_factor
GROUP BY   
	time</code></pre><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q9--2-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 14.1: Cryptocurrencies sorted by absolute highest daily return</strong></b></i></figcaption></figure><p>From Figure 14.1 above, we see that <a href="https://www.cryptocompare.com/coins/mixi/overview">Mixin (MIXI)</a>, an Ethereum based token, comes out on top with a maximum daily return factor of over 25 million times the previous day’s rate. <a href="https://www.cryptocompare.com/coins/bomb/overview">BOMB</a>, an experimental deflationary currency, had the second highest absolute daily return with a return factor of over 700,000.  We can get a better look at the rest of the data by removing these two outliers, as seen in Figure 14.2 below:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q9--3-.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 14.2: Cryptos sorted by absolute highest daily return (with MIXI, BOMB removed)</strong></b></i></figcaption></figure><p>From Figure 14.2, we see other cryptos that have had big days in the past 3 months were, <a href="https://www.cryptocompare.com/coins/xde2/overview">Double Eagle Coin (XDE2)</a>, <a href="https://www.cryptocompare.com/coins/xgr/overview">Gold Reserve (XGR)</a>, <a href="https://www.cryptocompare.com/coins/soul/overview">SoulCoin (SOUL)</a> and <a href="https://www.cryptocompare.com/coins/ccc/overview">CCCoin (CCC)</a>. </p><p>Furthermore, it’s interesting to note the difference in order of magnitude between the coins with the top daily return in the past 3 months, with MIXI achieving a daily return factor in the millions, BOMB in the hundreds of thousands and the next highest coins in the tens of thousands. </p><p>Another interesting thing to look at is the amount of times a coin had the top daily return during a certain period of time. Figure 15 shows us the coins with the highest frequency of having the top daily return during the 3 months between 16 June and 16 September 2019.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Q9.png" class="kg-image" alt="" loading="lazy"><figcaption><i><b><strong class="italic" style="white-space: pre-wrap;">Fig 15: Top coins by most days with highest daily return during a 3 month period</strong></b></i></figcaption></figure><p>The coins with the highest frequency of having the top daily return are <a href="https://www.cryptocompare.com/coins/mixi/overview">MIXI</a> (discussed above), with 5 unique days with the top daily return, <a href="https://www.coinbase.com/price/bitether">Bitether (BTR)</a>, a cryptocurrency built on the Ethereum platform, with unique 3 days, <a href="https://coinmarketcap.com/currencies/icechain/">IceChain (ICHX)</a> coming in third, with two unique days. </p><h2 id="conclusion">Conclusion</h2><p>In this post, we used the power of PostgreSQL and TimescaleDB to analyze a public cryptocurrency dataset of over 4100 cryptocurrencies over the time period 2010 to 2019. We examined time-series trends in Bitcoin and Ethereum prices, new coin growth, trading volume, daily returns, and more.</p><p>While our analysis aimed to provide a taste of what’s possible using PostgreSQL and TimescaleDB, we encourage you to take the tools we used and apply them to different crypto datasets and gain deeper insights!</p><p>We’ve created a <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">companion tutorial post</a> for those interested in re-creating this analysis or looking for a starting point to perform your own analysis. In the tutorial, you will find <a href="https://timescale.ghost.io/blog/tutorials/how-to-analyze-cryptocurrency-market-data-using-timescaledb-postgresql-and-tableau-a-step-by-step-tutorial/">step by step instructions</a> on how to create the dataset using Python (including all code we used for the analysis), how to load the data into <a href="https://portal.managed.timescale.com/login">Managed Service for TimescaleDB</a> and how to connect your database in the cloud to Tableau to recreate the analysis and produce graphs. If you do perform your own analysis, let us know what interesting insights you find! </p><p>Moreover, you can dig in to the technical side of Timescale and how we made PostgreSQL scalable for time-series data <a href="https://timescale.ghost.io/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/">in this detailed post</a>. If you’re interested in experiencing the power of Timescale for your time series data, sign up for <a href="https://portal.managed.timescale.com/login">Managed Service for TimescaleDB</a>!</p><p>Please drop your comments below and share this post with others whom you think would enjoy it. For follow-up questions or comments, reach out to us on Twitter (<a href="https://twitter.com/timescaledb">@TimescaleDB</a> or <a href="https://twitter.com/avthars">@avthars</a>), our community <a href="http://slack.timescale.com/">Slack channel</a>, or reach out to me directly via email (avthar at timescale dot com).</p>]]></content:encoded>
        </item>
    </channel>
</rss>