Category: All posts
Jun 11, 2025
Modern applications need data that's both accessible and fast. You have data in S3, but transforming it into usable insights requires complex ETL (extract-transform-load) pipelines. With our new livesync for S3 and pgai Vectorizer features, TigerData (creators of TimescaleDB) transforms how you interact with S3 data.
Our new features offer distinct approaches to working with S3 data:
Both eliminate complex ETL pipelines, letting you work with S3 data using familiar SQL.
S3 is where countless organizations store their data, but Tiger Cloud is where they unlock insights. Livesync for S3 bridges this gap, eliminating the traditional complexity of moving data between these systems.
Data management challenges create significant obstacles when bridging S3 storage and analytics environments. Organizations struggle with the manual effort required to transport data between S3 buckets and analytical databases, requiring custom integration code that demands ongoing maintenance. This challenge is compounded by the brittle and resource-intensive nature of maintaining ETL processes.
Many organizations find themselves caught in a constant battle to ensure data freshness, requiring vigilant monitoring systems to confirm that analytics platforms accurately reflect the most current information in S3 repositories. The culmination of these challenges frequently manifests as performance bottlenecks, where inefficient data transfer mechanisms cause critical delays in delivering up-to-date information to customer-facing applications, leading to poor user experiences and customers making decisions based on stale data.
Livesync for S3 bridges this gap, eliminating the traditional complexity of moving data between these systems. We've engineered livesync for S3 to bring stream-like behavior to object storage, effectively turning your S3 bucket into a continuous data feed.
Our solution delivers speed and simplicity:
Behind the scenes, we're doing the heavy lifting:
This enables teams across multiple industries to build robust pipelines. For organizations with production applications on Postgres looking to scale their real-time analytics, livesync for S3 has a sister solution—livesync for Postgres—which lets you keep your Postgres as-is while streaming data in real time to a Tiger Cloud instance optimized for analytical workloads.
Livesync employs a robust security model using AWS role assumption. Our service assumes a specific role in your AWS account with precisely the permissions needed to access your S3 data. To prevent confused deputy attacks, we implement the industry-standard External ID verification using your unique Project ID/Service ID combination.
Behind the scenes, livesync intelligently scans your S3 bucket using optimized ListObjectsV2 calls. Starting with the prefix from your pattern (like "logs/" from "logs/**/*.csv"), it applies glob matching to find relevant files. The system tracks processed files in lexicographical order, ensuring no file is missed or duplicated.
To maintain performance, livesync for S3 manages an orderly queue limited to 100 files per connection. When files are plentiful, polling accelerates to every minute; when caught up, it follows your configured schedule. You can always trigger immediate processing with the "Pull now" button.
Livesync handles different file formats with specialized techniques:
The entire pipeline includes intelligent error handling, which is clearly visible in the dashboard. After three consecutive failures, livesync automatically pauses to prevent resource waste, awaiting your review.
This architecture delivers the perfect balance of reliability, performance, and operational simplicity, bringing your S3 data into Tiger Cloud with minimal configuration and maximum confidence.
Build powerful ingest pipelines with minimal configuration:
Currently supporting CSV and Parquet file formats, livesync delivers a frictionless way to unlock the value of your data stored in S3.
Livesync for S3 continuously monitors your S3 bucket for incoming sensor data, automatically maps schemas, and syncs data into TimescaleDB hypertables in minutes. This enables operators to query millions of readings with millisecond latency, driving real-time dashboards that catch anomalies before equipment fails. Livesync for S3 ensures that syncing from S3 to hypertables remains smooth, dependable, and lightning-fast.
Setting up livesync for S3 is surprisingly straightforward:
Behind the scenes, we're doing the heavy lifting of schema mapping, managing the initial data load, and maintaining continuous synchronization. The system intelligently tracks what it's processed, so you never have duplicate data or missed files.
For example, in manufacturing environments where sensors continuously capture critical equipment data through AWS IoT Core and store it in S3, livesync ensures this data becomes immediately queryable in TimescaleDB. This enables operators to identify anomalies before equipment fails, turning static S3 storage into actionable intelligence.
Once configured, livesync for S3 delivers ease and performance:
While livesync brings S3 data into Postgres, pgai Vectorizer takes a different approach for unstructured documents. It creates searchable vector embeddings in Postgres from documents stored in S3 while keeping the original files in place.
AI applications using RAG (retrieval-augmented generation) can help businesses unlock insights from mountains of unstructured data. Today, that unstructured data’s natural home is Amazon S3. On the other hand, Postgres has become the default vector database for developers, thanks to extensions like pgvector and pgvectorscale. These extensions enable them to build intelligent applications with vector search capabilities without needing to use a separate database just for vectors.
We’ve previously written about how vector databases are the wrong abstraction because they divorce the source data from the vector embedding and lose the connection between unstructured data that's being embedded and the embeddings themselves. This problem is especially apparent for documents housed in object storage like Amazon S3.
Before pgai Vectorizer, developers typically needed to manage:
Processing documents in AI pipelines introduces several challenges, such as managing diverse file formats (PDFs, DOCX, XLSX, HTML, and more), handling complex metadata, keeping embeddings up to date with document changes, and ensuring efficient storage and retrieval.
To solve these challenges, TigerData has added support for document vectorization to pgai Vectorizer, giving developers an automated way to create embeddings from documents in Amazon S3 and keep those embeddings synchronized as the underlying data changes, eliminating the need for external ETL pipelines and queuing systems.
Pgai Vectorizer provides a streamlined approach where developers can reference documents in S3 (or local storage) via URLs stored in a database table. The vectorizer then handles the complete workflow—downloading documents, parsing them to extract content, chunking text appropriately, and generating embeddings for use in semantic search, RAG, or agentic applications.
This integration supports a wide variety of file formats, including:
For developers, pgai Vectorizer for document vectorization offers three key benefits:
By keeping your embeddings automatically synchronized to the source documents in S3, pgai Vectorizer ensures that your Postgres database remains the single source of truth for both your structured and vector data.
Pgai Vectorizer simplifies the entire document processing pipeline through a streamlined architecture that connects your Amazon S3 documents with Postgres. Here's how it works:
The pgai Vectorizer architecture for document vectorization consists of several key components:
Data sources: Postgres and Amazon S3
Vectorization configuration
Vectorizer worker (AWS Lambda)
Destination
The document vectorization process follows these steps:
Pgai Vectorizer's document vectorization capabilities enable several powerful use cases across industries by connecting S3-stored documents with Postgres vector search:
Automatically vectorize financial documents from S3 without custom pipelines. Connect document insights with quantitative metrics for unified queries.
Maintain synchronized knowledge bases of legal documents with automatic embedding updates. Test different models for your specific domain.
Make knowledge base content immediately searchable as it changes, connecting support documents with customer data.
Build research AI with continuously updated paper collections, connecting published findings with experimental time-series data.
In each case, pgai Vectorizer eliminates infrastructure complexity while enabling continuous improvement through its "set it and forget it" synchronization and simple experimentation capabilities.
Livesync and pgai Vectorizer are just the first steps in our vision to unify Postgres and object storage into a single, powerful lakehouse-style architecture—built for real-time AI and analytics.
→ Sign up for Tiger Cloud and get started in seconds.
We can’t wait to see what you build.
Tomorrow, we'll reveal how TigerData delivers high-speed performance without sacrificing simplicity through SQL Assistant with Agent Mode, Recommendations, and Insights. See how plain-language queries eliminate SQL wrangling, how automated tuning keeps databases optimized with a single click, and why developers finally get both the millisecond response times users demand and the operational simplicity teams need.