<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[Tiger Data Blog]]></title>
        <description><![CDATA[Insights, product updates, and tips from TigerData (Creators of TimescaleDB) engineers on Postgres, time series & AI. IoT, crypto, and analytics tutorials & use cases.]]></description>
        <link>https://www.tigerdata.com/blog</link>
        <image>
            <url>https://www.tigerdata.com/icon.ico</url>
            <title>Tiger Data Blog</title>
            <link>https://www.tigerdata.com/blog</link>
        </image>
        <generator>RSS for Node</generator>
        <lastBuildDate>Tue, 07 Apr 2026 13:06:30 GMT</lastBuildDate>
        <atom:link href="https://www.tigerdata.com/blog" rel="self" type="application/rss+xml"/>
        <ttl>60</ttl>
        <item>
            <title><![CDATA[PostgreSQL Data Cleaning vs. Python Data Cleaning]]></title>
            <description><![CDATA[Are you using the best tools for your PostgreSQL data cleaning tasks? Here’s an introduction to some time-saving tools you can use within PostgreSQL itself. ]]></description>
            <link>https://www.tigerdata.com/blog/postgresql-data-cleaning-vs-python</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/postgresql-data-cleaning-vs-python</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Miranda Auhl]]></dc:creator>
            <pubDate>Thu, 23 May 2024 13:59:00 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2024/05/PostgreSQL-Data-Cleaning-vs.-Python-Data-Cleaning--1-.png">
            </media:content>
            <content:encoded><![CDATA[<h2 id="introduction"><br>Introduction</h2><p>During analysis, you rarely—if ever—get to go directly from evaluating data to transforming and analyzing it. Sometimes, to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning—and that’s a lot of cleaning! To accomplish all this work, you may use Excel, R, or Python, but are these the best tools for your PostgreSQL data cleaning tasks?</p><p>In this blog post, I explore some classic <strong>data cleaning</strong> scenarios and show how you can perform them <em>directly within your database</em> using <a href="https://www.timescale.com/">TimescaleDB</a> and <a href="https://www.postgresql.org/">PostgreSQL</a>, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL can't replace these tools entirely. However, they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  </p><p>Cleaning is a critical part of the analysis process. In my experience, it can be the most grueling! By cleaning data directly within my database, I can perform many of my cleaning tasks one time rather than repetitively within a script. This saves me considerable time in the long run.</p><div class="kg-card kg-callout-card kg-callout-card-purple"><div class="kg-callout-emoji">🔖</div><div class="kg-callout-text">Learn how you can perform <a href="https://www.timescale.com/learn/time-series-analysis-in-r" rel="noreferrer">time-series analysis in R</a>.</div></div><p><br></p><h2 id="how-postgresql-data-cleaning-fits-in-the-data-analysis-process">How PostgreSQL Data Cleaning Fits in the Data Analysis Process</h2><p>I began this series of posts on <a href="https://timescale.ghost.io/blog/blog/speeding-up-data-analysis/">data analysis</a> by presenting the following summary of the analysis process:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Untitled.jpg" class="kg-image" alt="Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step" loading="lazy" width="1600" height="406" srcset="https://www.timescale.com/blog/content/images/size/w600/2022/01/Untitled.jpg 600w, https://www.timescale.com/blog/content/images/size/w1000/2022/01/Untitled.jpg 1000w, https://www.timescale.com/blog/content/images/2022/01/Untitled.jpg 1600w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Data analysis lifecycle</span></figcaption></figure><p>The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis. </p><p>However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/">data evaluation</a> techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, <strong>data cleaning</strong>. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.</p><p>As I mentioned <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/">in my last post</a>, my first job out of college was at an energy and sustainability solutions company that focused on monitoring utility usage—such as electricity, water, sewage, you name it—to determine how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.</p><p>Throughout this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database. </p><p>Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script. </p><p>This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project. </p><p>Instead, with PostgreSQL data cleaning, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make PostgreSQL data cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data. </p><p>To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our <a href="https://slack.timescale.com/">Slack channel</a>, where you can ask the Timescale community specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!</p><p>Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">a free 30-day trial of Timescal</a>e (if you prefer self-hosting, you can always <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage TimescaleDB on your own PostgreSQL instances</a>). Learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.</p><p>Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean,” I want to quickly cover the data set I'll be using. In addition, I want to note that all the code I show will assume you have some basic knowledge of SQL. If you aren't familiar with SQL, don’t worry! In my last post, I included a section on <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#sql-basics" rel="noreferrer">SQL basics</a>.</p><h2 id="about-the-sample-dataset">About the Sample Dataset</h2><p>In my experience within the data science realm, I have done most of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I'll use today, I'd likely do some initial cleaning before evaluation and then clean again after. Let me show you why. </p><p>I got the following <a href="https://www.kaggle.com/jaganadhg/house-hold-energy-data">IoT data set from Kaggle</a>, where a very generous individual shared their energy consumption readings from their apartment in San Jose, CA, the data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Tables.jpg" class="kg-image" alt="Graphic showing the setup of the table. The tables name is 'energy_usage_staging'. each row contains the tables column and data types, the pairs of info are as follows ([type, text], [date, date], [start_time, time], [end_time, time], [usage, float4], [units,text], [cost, text], [notes, text])" loading="lazy" width="960" height="764" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Tables.jpg 600w, https://timescale.ghost.io/blog/content/images/2022/01/Tables.jpg 960w" sizes="(min-width: 720px) 720px"></figure><p>and appears like this…</p><table>
<thead>
<tr>
<th>type</th>
<th>date</th>
<th>start_time</th>
<th>end_time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:00:00</td>
<td>00:14:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:15:00</td>
<td>00:29:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:30:00</td>
<td>00:44:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>00:45:00</td>
<td>00:59:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:00:00</td>
<td>01:14:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:15:00</td>
<td>01:29:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:30:00</td>
<td>01:44:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22</td>
<td>01:45:00</td>
<td>01:59:00</td>
<td>0.01</td>
<td>kWh</td>
<td>$0.00</td>
<td></td>
</tr>
</tbody>
</table>
<p><br>In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:</p><ul><li>The cost is seen as a text data type which will cause some issues.</li><li>The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.</li><li>I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household). </li></ul><p>In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!</p><h2 id="postgresql-data-cleaning">PostgreSQL Data Cleaning</h2><p>Here's what I've done in the past while working in data science. While these examples aren't exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.</p><p>Please feel free to explore these various techniques and skip around if you need! There's a lot here, and I designed it to be a helpful glossary of tools that you could use as needed.</p><p>The techniques I will cover include:</p><ul><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#correcting-structural-issues">Correcting structural issues</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#creating-or-generating-relevant-data">Creating or generating relevant data</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#adding-data-to-a-hypertable">Adding data to a hypertable</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#renaming-values">Renaming columns or tables</a></li><li><a href="https://timescale.ghost.io/blog/blog/postgresql-vs-python-for-data-cleaning-a-guide/#fill-in-missing-data">Fill in missing values</a></li></ul><h3 id="note-on-cleaning-approach">Note on cleaning approach:</h3><p>There are many ways that I could approach PostgreSQL data cleaning. I could create a table then <a href="https://www.postgresql.org/docs/current/sql-altertable.html"><code>ALTER</code></a> it as I clean, I could create multiple tables as I add or change data, or I could work with <a href="https://www.postgresql.org/docs/14/sql-createview.html"><code>VIEW</code></a>s. Depending on the size of my data, any of these approaches <em>could</em> make sense. However, they will have different computational consequences.</p><p>You may have noticed above that my raw data table was called <code>energy_usage_staging</code>. This is because I decided that given the state of my raw data, it's better to place the raw data in a <em>staging table</em>, clean it using <code>VIEW</code>s, and then insert it into a more usable table as part of my cleaning process. </p><p>This move from raw table to the usable table could happen even before the evaluation step of analysis. Sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL <code>VIEW</code>s, something I will talk about later.</p><p>Generally, if you're dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I'll show how to build up clean data using <code>VIEW</code>s along with additional tables. This method of cleaning is more efficient. It also sets up our next blog post about data transformation, which includes the use of scripts in PostgreSQL.</p><h3 id="correcting-structural-issues">Correcting structural issues</h3><p>Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have <code>date</code> and time columns separated and <code>costs</code> is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the <code>cost</code> column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.</p><h3 id="timescaledb-hypertables-and-why-timestamp-is-important">TimescaleDB hypertables and why timestamp is important</h3><p>For those of you not familiar with the structure of <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/hypertables-and-chunks/">TimescaleDB hypertables</a>, they're at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table. </p><p>The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some <em>chunk</em> based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Cleaning-data.jpg" class="kg-image" alt="Graphic showing a normal table vs a hypertable. The normal table just shows data in a table. The hypertable shows data in the table, but it also shows the data being &quot;grouped&quot; or &quot;chunked&quot; by day. By adding an index like structure based on time, queries can be more efficient. " loading="lazy" width="1600" height="1219" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2022/01/Cleaning-data.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2022/01/Cleaning-data.jpg 1000w, https://timescale.ghost.io/blog/content/images/2022/01/Cleaning-data.jpg 1600w" sizes="(min-width: 720px) 720px"></figure><h4 id="changing-date-time-structure">Changing date-time structure</h4><p>Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time-based queries, I want to restructure the <code>energy_usage_staging</code> table's <code>date</code> and time columns. I could use the <code>date</code> column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space-efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!  </p><p>Looking back at the table structure, I should be able to get a usable timestamp value from the <code>date</code> and <code>start_time</code> columns as the <code>end_time</code> really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert: it's as simple as an algebraic statement. How cool is that?!</p><p><strong>PostgreSQL code:</strong></p><p>For my PostgreSQL data cleaning code, I can create the column without inserting it into the database. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet. </p><p>Let’s first compare the original columns with our newly generated columns. For this query, I simply <em>add</em> the two columns together. The <code>AS</code> keyword allows me to rename the column to whatever I would like, in this case, <code>time</code>.</p><p><a name="add"></a></p>
<pre><code class="language-sql">--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;
</code></pre>
<p> Results:</p><table>
<thead>
<tr>
<th>date</th>
<th>start_time</th>
<th>time</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-10-22</td>
<td>00:00:00</td>
<td>2016-10-22 00:00:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>00:15:00</td>
<td>2016-10-22 00:15:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>00:30:00</td>
<td>2016-10-22 00:30:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>00:45:00</td>
<td>2016-10-22 00:45:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>01:00:00</td>
<td>2016-10-22 01:00:00.000</td>
</tr>
<tr>
<td>2016-10-22</td>
<td>01:15:00</td>
<td>2016-10-22 01:15:00.000</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><p>In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.</p><pre><code class="language-python">energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])
</code></pre>
<h3 id="changing-column-data-types">Changing column data types</h3><p>Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the <a href="https://www.postgresql.org/docs/14/functions-formatting.html"><code>TO_NUMBER()</code></a> function. </p><p>The format of the function is as follows: <code>TO_NUMBER(‘text’, ‘format’)</code> . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a <code>$</code> symbol followed by a numeric set up <code>0.00</code>. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point. </p><p>I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’</p><p><strong>PostgreSQL code:</strong></p><p><a name="tonumber"></a></p>
<pre><code class="language-sql">--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>cost</th>
<th>cost_new</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0.65</td>
<td>0.65</td>
</tr>
<tr>
<td>$0.65</td>
<td>0.65</td>
</tr>
<tr>
<td>$0.65</td>
<td>0.65</td>
</tr>
<tr>
<td>$0.57</td>
<td>0.57</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
<tr>
<td>$0.46</td>
<td>0.46</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><p>For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.</p><pre><code class="language-python">energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])
</code></pre>
<h3 id="creating-a-view">Creating a <code>VIEW</code></h3><p>Now that I know how to convert my columns, I can combine the two queries and create a <code>VIEW</code> of my new restructured table. A <a href="https://www.postgresql.org/docs/14/sql-createview.html"><code>VIEW</code></a> is a PostgreSQL object that allows you to define a query and call it by its <code>VIEW</code>s name as if it were a table within your database. I can use the following query to generate the data I want and then create a <code>VIEW</code> that I can query as if it were a table.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 02:00:00.000</td>
<td>0.02</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 02:15:00.000</td>
<td>0.02</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
</tr>
</tbody>
</table>
<p>I decided to call my <code>VIEW</code> <code>energy_view</code>. Now, when I want to do further cleaning, I can just specify its name in the <code>FROM</code> statement.</p><p><a name="view"></a></p>
<pre><code class="language-sql">--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging
</code></pre>
<p><strong>Python code:</strong></p><pre><code class="language-python">energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))
</code></pre>
<p>It is important to note that with PostgreSQL <code>VIEW</code>s, the data inside of them have to be recalculated every time you query it. This is why we want to insert our <code>VIEW</code> data into a hypertable once we have the data set up just right. You can think of <code>VIEW</code>s as a shorthand version of the <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#cte">CTEs <code>WITH</code> <code>AS</code></a> statement I discussed in my last post.</p><p>We are now one step closer to cleaner data!</p><h3 id="creating-or-generating-relevant-data">Creating or generating relevant data</h3><p>With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a <code>WHERE</code> clause and specify where <code>notes</code> are not equal to an empty string. </p><p><strong>PostgreSQL data cleaning code—detecting blank notes:</strong></p><p><a name="where"></a></p>
<pre><code class="language-sql">SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';
</code></pre>
<p>The results come out empty.</p><p><strong>Python code:</strong></p><pre><code class="language-python">print(energy_df[energy_df['notes'].notnull()])
</code></pre>
<p>Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the <code>EXTRACT()</code> command. The <a href="https://www.postgresql.org/docs/14/functions-datetime.html"><code>EXTRACT()</code></a> command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.</p><p><strong>PostgreSQL code:</strong></p><p><a name="extract"></a></p>
<pre><code class="language-sql">--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>notes</th>
<th>day_of_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td></td>
<td>6</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">energy_df['day_of_week'] = energy_df['time'].dt.dayofweek
</code></pre>
<p>Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. To do this, I create a boolean column, where <code>true</code> represents a weekend, and <code>false</code> represents a weekday. I then apply a <a href="https://www.postgresql.org/docs/14/plpgsql-control-structures.html"><code>CASE</code></a> statement. With this command, I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say <code>WHEN</code> a <code>day_of_week</code> value is <code>IN</code> the set (0,6) <code>THEN</code> the output should be <code>true</code>, <code>ELSE</code> the value should be <code>false</code>.</p><p><strong>PostgreSQL code:</strong></p><p><a name="case"></a></p>
<pre><code class="language-sql">SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
	ELSE false
END AS is_weekend
FROM energy_view ew
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week</th>
<th>is_weekend</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
</tr>
</tbody>
</table>
<p>Fun fact: you can do the same query without a <code>CASE</code> statement, however it only works for binary columns.</p><pre><code class="language-sql">--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew
</code></pre>
<p><strong>Python code:</strong></p><p>Notice that in Python, the weekends are represented by numbers 5 and 6 vs. the PostgreSQL weekend values 0 and 6.</p><pre><code class="language-python">energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))
</code></pre>
<p>And maybe things then start getting really crazy. Maybe you want to add more parameters! </p><p>Let’s consider holidays. Now, you may be asking, “Why in the world would we do that?!” but often, people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least <em>some </em>of the holidays off. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I create another boolean column identifying when a federal holiday occurs. </p><p>I can accomplish this using TimescaleDB’s <code>time_bucket()</code> function. The <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket()</code></a> function is one of the functions I discussed in detail in my <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#timebucket">previous post</a>. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the <code>time_bucket()</code> function, I would only see changes to the row associated with the 12 a.m. time period. </p><p><strong>PostgreSQL code:</strong></p><p>After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!</p><p><a name="timebucket"></a></p>
<pre><code class="language-sql">--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week</th>
<th>is_weekend</th>
<th>is_holiday</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 01:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
<td>6</td>
<td>true</td>
<td>false</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))
</code></pre>
<p>At this point, I’m going to save this expanded table into another <code>VIEW</code> so that I can call the data without writing out the query.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew
</code></pre>
<p>You may ask, “Why did you create these as boolean columns?!” A very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, you can use boolean columns to filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend <code>AND</code> a holiday. I can do this simply by adding a <code>WHERE</code> statement along with the specified columns.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week</th>
<th>is_weekend</th>
<th>is_holiday</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:00:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:15:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:30:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 00:45:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 01:00:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-12-24 01:15:00.000</td>
<td>0.34</td>
<td>kWh</td>
<td>0.06</td>
<td>6</td>
<td>true</td>
<td>true</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">print(energy_df[(energy_df['is_weekend']==1) &amp; (energy_df['is_holiday']==1)].head(10))
</code></pre>
<h3 id="adding-data-to-a-hypertable">Adding data to a hypertable</h3><p>Now  I have new columns ready to go, and I know how I would like my table to be structured. I can now create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have run clean up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, either that is before or after evaluation.</p><p><strong>PostgreSQL:</strong></p><p><a name="create"></a><br>
<a name="createhyper"></a><br>
<a name="insert"></a></p>
<pre><code class="language-sql">CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp
</code></pre>
<p>Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis. </p><p>I'll discuss this in detail in my next post. Make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!</p><h3 id="renaming-values">Renaming values</h3><p>Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the <a href="https://stackoverflow.com/questions/40427943/how-do-i-change-a-single-index-value-in-pandas-dataframe">popularity of this Python data analysis question on StackOverflow</a>. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple using PostgreSQL data cleaning. </p><p>When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s <code>CASE</code> statement within the <code>SELECT</code> query. Let’s say I don’t like Sunday being represented by a 0 in the <code>day_of_week</code> column, but would prefer it to be a 7. I can do this with the following query.</p><p><strong>PostgreSQL code:</strong></p><pre><code class="language-sql">SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage
</code></pre>
<p><strong>Python code:</strong></p><p>Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.</p><pre><code class="language-python">energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))
</code></pre>
<p>Now, let’s say that I wanted to use the names of the days of the week instead of showing numeric values. For this example, I want to ditch the <code>CASE</code> statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join it to this table using the <a href="https://www.postgresql.org/docs/14/queries-table-expressions.html"><code>JOIN</code></a> command.</p><p><strong>PostgreSQL:</strong></p><p><a name="join"></a></p>
<pre><code class="language-sql">--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>type</th>
<th>time</th>
<th>usage</th>
<th>units</th>
<th>cost</th>
<th>day_of_week_name</th>
<th>weekend</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:45:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:30:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:15:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-07-22 00:00:00.000</td>
<td>0.1</td>
<td>kWh</td>
<td>0.03</td>
<td>Sunday</td>
<td>true</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2018-02-11 23:00:00.000</td>
<td>0.04</td>
<td>kWh</td>
<td>0.01</td>
<td>Sunday</td>
<td>true</td>
</tr>
</tbody>
</table>
<p><strong>Python:</strong></p><p>In this case, python has similar mapping functions.</p><pre><code class="language-python">energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))
</code></pre>
<p>Hopefully, one of these techniques will be useful for you as you approach data renaming!</p><p>Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as <code>AS</code> (I couldn’t not use such a ridiculous statement 😂). When you use the <code>SELECT</code> statement, you can rename your columns like so,</p><p><strong>PostgreSQL code:</strong></p><p><a name="as"></a></p>
<pre><code class="language-sql">SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>usage_type</th>
<th>time_stamp</th>
<th>usage</th>
<th>units</th>
<th>dollar_amount</th>
</tr>
</thead>
<tbody>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:00:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:15:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:30:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
<tr>
<td>Electric usage</td>
<td>2016-10-22 00:45:00.000</td>
<td>0.01</td>
<td>kWh</td>
<td>0.00</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><p>Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.</p><pre><code class="language-python">energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))
</code></pre>
<h3 id="fill-in-missing-data">Fill in missing data</h3><p>Another common problem in the PostgreSQL data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points. However, it's very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. </p><p>This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this. </p><p>For example, let’s say that you are modeling energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values. </p><p>As an example, I created some data. I called this table <code>energy_data</code> and it is missing both <code>time</code> and <code>energy</code> readings for the timestamps between 7:45 a.m. and 11:30 a.m.</p><table>
<thead>
<tr>
<th>time</th>
<th>energy</th>
</tr>
</thead>
<tbody>
<tr>
<td>2021-01-01 07:00:00.000</td>
<td>0</td>
</tr>
<tr>
<td>2021-01-01 07:15:00.000</td>
<td>0.1</td>
</tr>
<tr>
<td>2021-01-01 07:30:00.000</td>
<td>0.1</td>
</tr>
<tr>
<td>2021-01-01 07:45:00.000</td>
<td>0.2</td>
</tr>
<tr>
<td>2021-01-01 11:30:00.000</td>
<td>0.04</td>
</tr>
<tr>
<td>2021-01-01 11:45:00.000</td>
<td>0.04</td>
</tr>
<tr>
<td>2021-01-01 12:00:00.000</td>
<td>0.03</td>
</tr>
<tr>
<td>2021-01-01 12:15:00.000</td>
<td>0.02</td>
</tr>
<tr>
<td>2021-01-01 12:30:00.000</td>
<td>0.03</td>
</tr>
<tr>
<td>2021-01-01 12:45:00.000</td>
<td>0.02</td>
</tr>
<tr>
<td>2021-01-01 13:00:00.000</td>
<td>0.03</td>
</tr>
</tbody>
</table>
<p>I can use TimescaleDB’s <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/">gapfilling hyperfunctions</a> to fill in these missing values. The <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/interpolate/"><code>interpolate()</code></a> function is another one of TimescaleDB’s hyperfunctions. It creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/locf/"><code>locf()</code></a> hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the <a href="https://docs.timescale.com/api/latest/hyperfunctions/gapfilling-interpolation/time_bucket_gapfill/"><code>time_bucket_gapfill()</code></a> function. </p><p><strong>PostgreSQL code:</strong></p><p><a name="gapfill"></a></p>
<pre><code class="language-sql">SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time &gt; '2021-01-01 07:00:00.000' AND time &lt; '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;
</code></pre>
<p>Results:</p><table>
<thead>
<tr>
<th>timestamp</th>
<th>interpolate</th>
<th>locf</th>
</tr>
</thead>
<tbody>
<tr>
<td>2021-01-01 07:00:00.000</td>
<td>0.1</td>
<td>0.10000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 07:30:00.000</td>
<td>0.15</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 08:00:00.000</td>
<td>0.13625</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 08:30:00.000</td>
<td>0.1225</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 09:00:00.000</td>
<td>0.10875</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 09:30:00.000</td>
<td>0.095</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 10:00:00.000</td>
<td>0.08125</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 10:30:00.000</td>
<td>0.0675</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 11:00:00.000</td>
<td>0.05375</td>
<td>0.15000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 11:30:00.000</td>
<td>0.04</td>
<td>0.04000000000000000000</td>
</tr>
<tr>
<td>2021-01-01 12:00:00.000</td>
<td>0.025</td>
<td>0.02500000000000000000</td>
</tr>
<tr>
<td>2021-01-01 12:30:00.000</td>
<td>0.025</td>
<td>0.02500000000000000000</td>
</tr>
</tbody>
</table>
<p><strong>Python code:</strong></p><pre><code class="language-python">energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)
</code></pre>
<p><strong>Bonus:</strong></p><p>The following query shows how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a <code>WHERE</code> clause to specify the times I like to ignore (the second query).</p><pre><code class="language-sql">SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time &lt;= '2021-01-01 07:45:00.000' OR time &gt;= '2021-01-01 11:30:00.000'
</code></pre>
<h2 id="postgresql-data-cleaning-wrap-up">PostgreSQL Data Cleaning Wrap-Up</h2><p>After reading through these various techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL data cleaning and TimescaleDB data cleaning provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you're looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB. </p><p>In my next posts, I'll discuss techniques for transforming data using PostgreSQL and TimescaleDB. I'll then use everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).</p><p>If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our <a href="https://slack.timescale.com/">community Slack</a>, where you'll find an active community of time-series enthusiasts and various Timescale team members.</p><p>If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can <a href="https://console.cloud.timescale.com/signup" rel="noreferrer">sign up for a free 30-day trial</a> or install TimescaleDB and <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">manage it on your current PostgreSQL instances</a>. We also have a bunch of <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">great tutorials</a> to help get you started.</p><p>Until next time!</p><p><strong>Functionality Glossary:</strong></p>
<ul>
<li><a href="#add">Adding columns together</a></li>
<li><a href="#tonumber"><code>TO_NUMBER()</code></a></li>
<li><a href="#view"><code>VIEW</code></a></li>
<li><a href="#where"><code>WHERE</code></a></li>
<li><a href="#extract"><code>EXTRACT()</code></a></li>
<li><a href="#case"><code>CASE</code></a></li>
<li><a href="#timebucket"><code>time_bucket()</code></a></li>
<li><a href="#join"><code>JOIN</code></a></li>
<li><a href="#as"><code>AS</code></a></li>
<li><a href="#create"><code>CREATE TABLE</code></a></li>
<li><a href="#createhyper"><code>create_hypertable()</code></a></li>
<li><a href="#insert"><code>INSERT INTO</code></a></li>
<li><a href="#gapfill"><code>time_bucket_gapfill()</code></a></li>
</ul>
]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[PostgreSQL vs Python for Data Evaluation: What, Why, and How]]></title>
            <description><![CDATA[Get a primer on using TimescaleDB and PostgreSQL to more efficiently perform your data evaluation tasks - previously done in Excel, R, or Python. Complete with short SQL refresher section, along with 1-to-1 code snippets comparing TimescaleDB and PostgreSQL code against Python code.]]></description>
            <link>https://www.tigerdata.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient</guid>
            <category><![CDATA[General]]></category>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Benchmarks & Comparisons]]></category>
            <dc:creator><![CDATA[Miranda Auhl]]></dc:creator>
            <pubDate>Fri, 01 Oct 2021 13:35:34 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2021/09/gabriel-crismariu-sOK9NjLArCw-unsplash.jpg">
            </media:content>
            <content:encoded><![CDATA[<h2 id="introduction">Introduction</h2><p>As I started writing this post, I realized that to properly show how to evaluate, clean, and transform data in the database (also known as data munging), I needed to focus on each step individually. This blog post will show you exactly how to use TimescaleDB and PostgreSQL to perform your <strong>data evaluation tasks</strong> that you may have previously done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/evaluation tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  </p><p>You may be asking yourself, “What exactly do you mean by <em>Evaluating</em> the data?”. When I talk about evaluating the data, I mean <em>really</em> understanding the data set you are working with. </p><p>If - in a theoretical world - I could grab a beer with my data set and talk to it about everything, that is what I would do during the evaluating step of my data analysis process. Before beginning analysis, I want to know every column, every general trend, every connection between tables, etc. To do this, I have to sit down and run query after query to get a solid picture of my data.</p><h3 id="recap">Recap</h3><p>If you remember, <a href="https://timescale.ghost.io/blog/blog/speeding-up-data-analysis/">in my last post</a>, I summarized the analysis process as the “data analysis lifecycle” with the following steps: Evaluate, Clean, Transform, and Model.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2021/09/data-analysis-lifecycle-1.jpeg" class="kg-image" alt=" Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step" loading="lazy" width="1838" height="466" srcset="https://blog.timescale.com/content/images/size/w600/2021/09/data-analysis-lifecycle-1.jpeg 600w, https://blog.timescale.com/content/images/size/w1000/2021/09/data-analysis-lifecycle-1.jpeg 1000w, https://blog.timescale.com/content/images/size/w1600/2021/09/data-analysis-lifecycle-1.jpeg 1600w, https://blog.timescale.com/content/images/2021/09/data-analysis-lifecycle-1.jpeg 1838w" sizes="(min-width: 720px) 720px"><figcaption>Data Analysis Lifecycle</figcaption></figure><p>As a data analyst, I found that all the tasks I performed could be grouped into these four categories, with evaluating the data as the first and I feel the most crucial step in the process. </p><p>My first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of usage - such as electricity, water, sewage, you name it - to figure out how buildings could be more efficient. They would place sensors on whatever medium you wanted to monitor to help you figure out what initiatives your group could take to be more sustainable and ultimately save costs. My role at this company was to perform data analysis and business intelligence tasks.</p><p>Throughout my time in this job, I got the chance to use many popular tools to evaluate my data, including Excel, R, Python, and heck, even Minitab. But once I tried using a database - and specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward evaluating work could be when done directly in a database. Lines of code that took me a while to hunt down online, trying to figure out how to accomplish with pandas, could be done intuitively through SQL. Plus, the database queries were just as fast, if not faster, than my other code most of the time. </p><p>Now, while I would love to show you a one-to-one comparison of my SQL code against each of these popular tools, that’s not practical. Besides, no one wants to read three examples of the same thing in a row! Thus, for comparison purposes in this blog post, I will directly show TimescaleDB and PostgreSQL functionality against Python code. Keep in mind that almost all code will likely be comparable to your Excel and R code. However, if you have any questions, feel free to hop on and join our <a href="https://slack.timescale.com/">Slack channel</a>, where you can ask the Timescale community, or me, specifics on TimescaleDB or PostgreSQL functionality 😊. I’d love to hear from you!</p><p>Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! If so, you can <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage TimescaleDB on your own PostgreSQL instances</a>. (You can also learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.)</p><p>But enough of an intro, let’s get into the good stuff!</p><!--kg-card-begin: html--><center><iframe src="https://giphy.com/embed/QvwCVnX9DWdlHCnix5" width="280" height="280" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/cbc-schitts-creek-QvwCVnX9DWdlHCnix5">via GIPHY</a></p></center>
<!--kg-card-end: html--><h2 id="sql-basics">SQL basics</h2><!--kg-card-begin: markdown--><p>PostgreSQL is a database platform that uses SQL syntax to interact with the data inside it. TimescaleDB is an extension that is applied to a PostgreSQL database. To unlock the potential of PostgreSQL and TimescaleDB, you have to use SQL. So, before we jump into things, I wanted to give a basic SQL syntax refresher. <a href="https://timescale.ghost.io/blog/blog/how-to-evaluate-your-data-directly-within-the-database-and-make-your-analysis-more-efficient/#a-quick-note-on-the-data">If you are familiar with SQL, please feel free to skip this section!</a></p>
<!--kg-card-end: markdown--><p>For those of you who are newer to SQL (short for structured query language), it is the language many relational databases, including PostgreSQL, use to query data. Like <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">pandas’ DataFrames</a> or Excel’s spreadsheets, data queried with SQL is structured as a table with columns and rows.</p><p>The <a href="https://www.postgresql.org/docs/current/sql-select.html">basics of a SQL <code>SELECT</code> command</a> can be broken down like this 👇</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT --columns, functions, aggregates, expressions that describe what you want to be shown in the results
FROM --if selecting data from a table in your DB, you must define the table name here
JOIN --join another table to the FROM statement table 
    ON --a column that each table shares values 
WHERE --statement to filter results where a column or expression is equivalent to some statement
GROUP BY --if SELECT or WHERE statement contains an aggregate, or if you want to group values on a column/expression, must include columns here
HAVING --similar to WHERE, this keyword helps to filter results based upon columns or expressions specifically used with a GROUP BY query
ORDER BY --allows you to specify the order in which your data is displayed
LIMIT --lets you specify the number of rows you want displayed in the output
</code></pre>
<!--kg-card-end: markdown--><p>You can think of your queries as <em>SELECTing </em>data <em>FROM </em>your tables within your database. You can <em>JOIN</em> multiple tables together and specify <em>WHERE</em> your data needs to be filtered or what it should be <em>GROUPed BY</em>. Do you see what I did there 😋?</p><p>This is the beauty of SQL; these keyword's names were chosen to make your queries intuitive. Thankfully, most of PostgreSQL and SQL functionality follow this same easy-to-read pattern. I have had the opportunity to teach myself many programming languages throughout my career, and SQL is by far the easiest to read, write, and construct. This intuitive nature is another excellent reason why data munging in PostgreSQL and TimescaleDB can be so efficient when compared to other methods. </p><p>Note that this list of keywords includes most of the ones you will need to start selecting data with SQL; however, it is not exhaustive. You will not need to use all these phrases for every query but likely will need at least <code>SELECT</code> and <code>FROM</code>. The queries in this blog post will always include these two keywords.</p><p>Additionally, the order of these keywords is specific. When building your queries, you need to follow the order that I used above. For any additional PostgreSQL commands you wish to use, you will have to research where they fit in the order hierarchy and follow that accordingly. </p><p>Seeing a list of commands may be somewhat helpful but is likely not enough to solidify understanding if you are like me. So let’s look at some examples!</p><p>Let’s say that I have a table in my PostgreSQL database called <code>energy_usage</code>. This table contains three columns: <code>time</code> which contains timestamp values, <code>energy</code> which contains numeric values, and <code>notes</code> which contains string values. As you may be able to imagine, every row of data in my <code>energy</code> table will contain,</p><ul><li><code>time</code>: timestamp value saying when the reading was collected</li><li><code>energy</code>: numeric value representing how much energy was used since the last reading</li><li><code>notes</code>: string value giving additional context to each reading. <br></li></ul><p>If I wanted to look at all the data within the table, I could use the following SQL query</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT time, energy, notes --I list my columns here
FROM energy_usage ;-- I list my table here and end query with semi-colon
</code></pre>
<!--kg-card-end: markdown--><p>Alternatively, SQL has a shorthand for ‘include all columns’, the operator <code>*</code>. So I could select all the data using this query as well,</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT *
FROM energy_usage;
</code></pre>
<!--kg-card-end: markdown--><p>What if I want to select the data and order it by the <code>time</code> column so that the earliest readings are first and the latest are last? All I need to do is include the <code>ORDER BY</code> statement and then specify the <code>time</code> column along with the specification <code>ASC</code> to let the database know I want the data in ascending order.</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT *
FROM energy_usage
ORDER BY time ASC;-- first I list my time column then I specify either DESC or ASC
</code></pre>
<!--kg-card-end: markdown--><p>Hopefully, you can start to see the pattern and feel more comfortable with SQL syntax. I will be showing a lot more code snippets throughout the post, so hang tight if you still need more examples!</p><p>So now that we have a little refresher on SQL basics, let’s jump into how you can use this language along with TimescaleDB and PostgreSQL functionality to do your data evaluating tasks!</p><!--kg-card-begin: markdown--><p><a name="datanotepart"></a></p>
<!--kg-card-end: markdown--><h2 id="a-quick-note-on-the-data">A quick note on the data</h2><p>Earlier I talked about my first job as a data analyst for an IoT sustainability company. Because of this job, I tend to love IoT data sets and couldn’t pass up the chance to explore <a href="https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries">this IoT dataset from Kaggle</a> to show how to perform data munging tasks in PostgreSQL and TimescaleDB. </p><p>The data set contains two tables, one specifying energy consumption for a single home in Houston, Texas (called <code>power_usage</code>), and the other documenting weather conditions (called <code>weather</code>). This data is actually the same data set that I used in my previous post, so bonus points if you caught that 😊!</p><p>This data was recorded from January 2016 to December 2020. While looking at this data set, and all time-series data sets, we must consider any outside influences that could affect the data. The most obvious factor that impacts the analysis of this dataset is the COVID-19 pandemic that took place from January 9th through to December 2020. Thankfully, we will see that the individual recording this data included some notes to help categorize days affected by the pandemic. As I go through this blog series, we will see patterns associated with the data collected during the COVID-19 pandemic, so definitely keep this fact in the back of your mind as we perform various data munging analysis steps!</p><p>Here is an image explaining the two tables, their column names in red and corresponding data types in blue.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2021/09/Tables.jpg" class="kg-image" alt="explanation of the the power_usage table and weather table. The power table has four columns: startdate (timestamp), value_kWh (numeric), day_of_week (int),  notes (varchar). Weather has date (date), day (int), temp_max (numeric), temp_avg (numeric), temp_min (numeric), dew_max (numeric), dew_avg (numeric), dew_min (numeric), hum_max	(numeric), hum_avg	(numeric), hum_min	(numeric), wind_max (numeric), wind_avg (numeric), wind_min (numeric), press_max (numeric), press_avg (numeric), press_min (numeric), precipit (numeric), day_of_week (int)" loading="lazy" width="1480" height="1472" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/09/Tables.jpg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/09/Tables.jpg 1000w, https://timescale.ghost.io/blog/content/images/2021/09/Tables.jpg 1480w" sizes="(min-width: 720px) 720px"></figure><p>As we work through this blog post, we will use the evaluating techniques available within PostgreSQL and TimescaleDB to understand these two tables inside and out.</p><h2 id="evaluating-the-data">Evaluating the data</h2><p>As we discussed before, the first step in the data analysis lifecycle - and arguably the most critical step -  is to evaluate the data. I will go through how I would approach evaluating this IoT energy data, showing most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, they will cover many of the evaluating steps you perform during your analysis, helping to make your evaluating tasks more efficient by using PostgreSQL and TimescaleDB. </p><p>The techniques that I will cover include:</p><!--kg-card-begin: markdown--><ul>
<li><a href="#technique1">Reading the raw data</a></li>
<li><a href="#technique2">Finding and observing “categorical” column values in my dataset</a></li>
<li><a href="#technique3">Sorting my data by specific columns</a></li>
<li><a href="#technique4">Displaying grouped data</a></li>
<li><a href="#technique5">Finding abnormalities in the database</a></li>
<li><a href="#technique6">Looking at general trends</a></li>
</ul>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique1"></a></p>
<!--kg-card-end: markdown--><h3 id="reading-the-raw-data">Reading the raw data</h3><p>Let’s start with the most simple evaluating task, looking at the raw data.</p><p>As we learned in the SQL refresher above, we can quickly pull all the data within a table by using the <code>SELECT</code> statement with the <code>*</code> operator. Since I have two tables within my database, I will query both table’s information by running a query for each.</p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="select"></a></p>
<pre><code class="language-sql">-- select all the data from my power_usage table
SELECT * 
FROM power_usage pu; 
-- selects all the data from my weather table
SELECT * 
FROM weather w;
</code></pre>
<!--kg-card-end: markdown--><p>But what if I don’t necessarily need to query all my data? Since all the data is housed in the database, if I want to get a feel for the data and the column values, I could just look at a snapshot of the raw data. </p><p>While conducting analysis in Python, I often would just print a handful of rows of data to get a feel for the values. We can do this in PostgreSQL by including the <code>LIMIT</code> command within our query. To show the first 20 rows of data in my tables, I can do the following:</p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="limit"></a></p>
<pre><code class="language-sql">-- select all the data from my power_usage table
SELECT * 
FROM power_usage pu
LIMIT 20; -- specify 20 because I only want to see 20 rows of data
-- selects all the data from my weather table
SELECT * 
FROM weather w 
LIMIT 20;
</code></pre>
<!--kg-card-end: markdown--><p>Results: Some of the rows for each table</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>startdate</th>
<th>value_kwh</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06 01:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 02:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 03:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 04:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 05:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 06:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><div style="width:100%;overflow:auto;">
<table>
<thead>
<tr>
<th>date</th>
<th>day</th>
<th>temp_max</th>
<th>temp_avg</th>
<th>temp_min</th>
<th>dew_max</th>
<th>dew_avg</th>
<th>dew_min</th>
<th>hum_max</th>
<th>hum_avg</th>
<th>hum_min</th>
<th>wind_max</th>
<th>wind_avg</th>
<th>wind_min</th>
<th>press_max</th>
<th>press_avg</th>
<th>press_min</th>
<th>precipit</th>
<th>day_of_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06</td>
<td>1</td>
<td>85</td>
<td>75</td>
<td>68</td>
<td>74</td>
<td>71</td>
<td>66</td>
<td>100</td>
<td>89</td>
<td>65</td>
<td>21</td>
<td>10</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>2016-02-06</td>
<td>2</td>
<td>76</td>
<td>71</td>
<td>66</td>
<td>74</td>
<td>70</td>
<td>66</td>
<td>100</td>
<td>97</td>
<td>89</td>
<td>18</td>
<td>8</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>2016-02-07</td>
<td>2</td>
<td>95</td>
<td>86</td>
<td>76</td>
<td>76</td>
<td>73</td>
<td>69</td>
<td>94</td>
<td>67</td>
<td>43</td>
<td>12</td>
<td>6</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>6</td>
</tr>
<tr>
<td>2016-02-08</td>
<td>2</td>
<td>97</td>
<td>87</td>
<td>77</td>
<td>77</td>
<td>74</td>
<td>71</td>
<td>94</td>
<td>66</td>
<td>43</td>
<td>15</td>
<td>5</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>2016-02-09</td>
<td>2</td>
<td>95</td>
<td>85</td>
<td>77</td>
<td>75</td>
<td>74</td>
<td>70</td>
<td>90</td>
<td>70</td>
<td>51</td>
<td>16</td>
<td>7</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>2016-02-10</td>
<td>2</td>
<td>86</td>
<td>74</td>
<td>65</td>
<td>64</td>
<td>61</td>
<td>58</td>
<td>90</td>
<td>66</td>
<td>40</td>
<td>8</td>
<td>4</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>2016-03-06</td>
<td>3</td>
<td>79</td>
<td>72</td>
<td>68</td>
<td>72</td>
<td>70</td>
<td>68</td>
<td>100</td>
<td>94</td>
<td>72</td>
<td>18</td>
<td>5</td>
<td>0</td>
<td>30</td>
<td>30</td>
<td>30</td>
<td>3</td>
<td>6</td>
</tr>
</tbody>
</table>
</div><!--kg-card-end: markdown--><p>Python code:</p><p>In this first Python code snippet, I show the modules I needed to import and the connection code that I would have to run to access the data from my database and import it into a pandas DataFrame. </p><p>One of the challenges I faced while data munging in Python was the need to run through the entire script again and again when evaluating, cleaning, and transforming the data. This initial data pulling process usually takes a good bit of time, so it was often frustrating to run through it repetitively. I also would have to run print anytime I wanted to quickly glance at an array, Dataframe, or element. These kinds of extra tasks in Python can be time-consuming, especially if you end up at the modeling stage of the analysis lifecycle with only a subset of the original data! All this to say, keep in mind that for the other code snippets within the blog, I will not include this as part of the code; however, it still impacts that code in the background. </p><p>Additionally, because I have my data housed in a TimescaleDB instance, I still need to use the <code>SELECT</code> statement to query the data from the database and read it into Python. If you use a relational database - which I explained is very beneficial to analysis in my previous post - you will have to use <em>some</em> SQL.</p><!--kg-card-begin: markdown--><pre><code class="language-python">import psycopg2
import pandas as pd
import configparser
import numpy as np
import tempfile
import matplotlib.pyplot as plt
 
## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')
 
## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'),
                       host=config.get('USERINFO', 'HOST'),
                       user=config.get('USERINFO', 'USER'),
                       password=config.get('USERINFO', 'PASS'),
                       port=config.get('USERINFO', 'PORT'))
 

## define the queries for copying data out of our database (using format to copy queries)                    
query_weather = &quot;select * from weather&quot;
query_power = &quot;select * from power_usage&quot;
## define function to copy the data to a csv
def copy_from_db(query, cur):
    with tempfile.TemporaryFile() as tmpfile:
        copy_sql = &quot;COPY ({query}) TO STDOUT WITH CSV {head}&quot;.format(
            query=query, head=&quot;HEADER&quot;
            )
        cur.copy_expert(copy_sql, tmpfile)
        tmpfile.seek(0)
        df = pd.read_csv(tmpfile)
        return df
## create cursor to use in function above and place data into a file
cursor = conn.cursor()
weather_df = copy_from_db(query_weather, cursor)
power_df = copy_from_db(query_power, cursor)
cursor.close()
conn.close()


print(weather_df.head(20))
print(power_df.head(20))
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique2"></a></p>
<!--kg-card-end: markdown--><h3 id="finding-and-observing-%E2%80%9Ccategorical%E2%80%9D-column-values-in-my-dataset">Finding and observing “categorical” column values in my dataset</h3><p>Next, I think it is essential to understand any “categorical” columns - columns with a finite set of values - that I might have. This is useful in analysis because categorical data can give insight into natural groupings that often occur within a dataset. For example, I would assume that energy usage for many people is different on a weekday vs. a weekend. We can’t verify this without knowing the categorical possibilities and seeing how each could impact the data trend. </p><p>First, I want to look at my tables and the data types used for each column. Looking at the available columns in each table, I can make an educated guess that the <code>day_of_week</code>, <code>notes</code>, and <code>day</code> columns will be categorical. Let’s find out if they indeed are and how many different values exist in each. </p><p>To find all the distinct values within a column (or between multiple columns), you can use the <code>DISTINCT</code> keyword after <code>SELECT</code> in your query statement. This can be useful for several data munging tasks, such as identifying categories - which I need to do - or finding unique sets of data. </p><p>Since I want to look at the unique values within each column individually, I will run a query for each separately. If I were to run a query like this 👇</p><!--kg-card-begin: markdown--><pre><code class="language-sql">SELECT DISTINCT day_of_week, notes 
FROM power_usage pu;
</code></pre>
<!--kg-card-end: markdown--><p>I would get data like this</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>vacation</td>
</tr>
<tr>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>1</td>
<td>vacation</td>
</tr>
<tr>
<td>2</td>
<td>vacation</td>
</tr>
<tr>
<td>4</td>
<td>vacation</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>The output data would show unique <em>pairs</em> of <code>day_of_week</code> and <code>notes</code> <em>related</em> values within the table. This is why I need to include a single column in each statement so that I only see that individual column’s unique values and not the unique sets of values. </p><p>For these queries, I am also going to include the <code>ORDER BY</code> command to show the values of each column in ascending order. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="orderby"></a></p>
<pre><code class="language-sql">-- selecting distinct values in the ‘day_of_week’ column within my power_usage table
SELECT DISTINCT day_of_week 
FROM power_usage pu 
ORDER BY day_of_week ASC;
-- selecting distinct values in the ‘notes’ column within my power_usage table
SELECT DISTINCT notes 
FROM power_usage pu 
ORDER BY notes ASC;

-- selecting distinct values in the ‘day’ column within my weather table
SELECT DISTINCT &quot;day&quot; 
FROM weather w 
ORDER BY &quot;day&quot; ASC;
-- selecting distinct values in the ‘day_of_week’ column within my weather table
SELECT DISTINCT day_of_week 
FROM weather w 
ORDER BY day_of_week ASC;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><p>Notice that we see the recorder for this data included “COVID-19” as a category in their <code>notes</code> column. As mentioned above, this note could be necessary to finding and understanding patterns in this family's energy usage.</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day_of_week</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
</tr>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
<tr>
<td>5</td>
</tr>
<tr>
<td>6</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>COVID_lockdown</td>
</tr>
<tr>
<td>vacation</td>
</tr>
<tr>
<td>weekday</td>
</tr>
<tr>
<td>weekend</td>
</tr>
</tbody>
</table>
<p>(Only some of the values shown for day)</p>
<table>
<thead>
<tr>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
<tr>
<td>5</td>
</tr>
<tr>
<td>6</td>
</tr>
<tr>
<td>7</td>
</tr>
<tr>
<td>8</td>
</tr>
<tr>
<td>9</td>
</tr>
<tr>
<td>10</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>In my Python code, notice that I need to print anything that I want to quickly observe. I have found this to be the quickest solution, even when compared to using the Python console in debug mode.</p><!--kg-card-begin: markdown--><pre><code class="language-python">p_day_of_the_week = power_df['day_of_week'].unique()
p_notes = power_df['notes'].unique()
w_day = weather_df['day'].unique()
w_day_of_the_week = power_df['day_of_week'].unique()
print(sorted(p_day_of_the_week), sorted(p_notes), sorted(w_day), sorted(w_day_of_the_week))
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique3"></a></p>
<!--kg-card-end: markdown--><h3 id="sorting-my-data-by-specific-columns">Sorting my data by specific columns</h3><p>What if I want to evaluate my tables based on how specific columns were sorted? One of the top questions asked on StackOverflow for Python data analysis is "<a href="https://stackoverflow.com/questions/17141558/how-to-sort-a-dataframe-in-python-pandas-by-two-or-more-columns">How to sort a dataframe in python pandas by two or more columns?</a>". Once again, we can do this intuitively through SQL.</p><p>One of the things I'm interested in identifying is how bad weather impacts energy usage. To do this, I have to think about indicators that typically signal bad weather, which include high precipitation, high wind speed, and low pressure. To identify days with this pattern in my PostgreSQL <code>weather</code> table, I need to use the <code>ORDER BY</code> keyword, then call out each column in the order I want things sorted, specifying the <code>DESC</code> and <code>ASC</code> attributes as needed. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- sort weather data by precipitation desc first, wind_avg desc second, and pressure asc third
SELECT &quot;date&quot;, precipit, wind_avg, press_avg 
FROM weather w 
ORDER BY precipit DESC, wind_avg DESC, press_avg ASC;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>date</th>
<th>precipit</th>
<th>wind_avg</th>
<th>press_avg</th>
</tr>
</thead>
<tbody>
<tr>
<td>2017-08-27</td>
<td>13</td>
<td>15</td>
<td>30</td>
</tr>
<tr>
<td>2017-08-28</td>
<td>11</td>
<td>24</td>
<td>30</td>
</tr>
<tr>
<td>2019-09-20</td>
<td>9</td>
<td>9</td>
<td>30</td>
</tr>
<tr>
<td>2017-08-08</td>
<td>6</td>
<td>5</td>
<td>30</td>
</tr>
<tr>
<td>2017-08-29</td>
<td>5</td>
<td>22</td>
<td>30</td>
</tr>
<tr>
<td>2018-08-12</td>
<td>5</td>
<td>12</td>
<td>30</td>
</tr>
<tr>
<td>2016-02-06</td>
<td>4</td>
<td>8</td>
<td>30</td>
</tr>
<tr>
<td>2018-05-07</td>
<td>4</td>
<td>7</td>
<td>30</td>
</tr>
<tr>
<td>2019-10-05</td>
<td>3</td>
<td>9</td>
<td>30</td>
</tr>
<tr>
<td>2018-03-29</td>
<td>3</td>
<td>8</td>
<td>30</td>
</tr>
<tr>
<td>2016-03-06</td>
<td>3</td>
<td>5</td>
<td>30</td>
</tr>
<tr>
<td>2018-06-19</td>
<td>2</td>
<td>12</td>
<td>30</td>
</tr>
<tr>
<td>2019-08-05</td>
<td>2</td>
<td>11</td>
<td>30</td>
</tr>
<tr>
<td>2019-10-30</td>
<td>2</td>
<td>11</td>
<td>30</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>I have often found the different pandas or Python functions to be harder to know off the top of my head. With how popular the StackOverflow question is, I can imagine that many of you also had to refer to Google for how to do this initially.</p><!--kg-card-begin: markdown--><pre><code class="language-python">sorted_weather = weather_df[['date', 'precipit', 'wind_avg', 'press_avg']].sort_values(['precipit', 'wind_avg', 'press_avg'], ascending=[False, True, False])
print(sorted_weather)
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique4"></a></p>
<!--kg-card-end: markdown--><h3 id="displaying-grouped-data">Displaying grouped data</h3><p>Finding the sum of energy usage from data that records energy per hour can be instrumental in understanding data patterns. This concept boils down to performing a type of aggregation over a particular column. Between PostgreSQL and TimescaleDB, we have access to almost every type of aggregation function we could need. I will show some of these operators in this blog series, but I strongly encourage all of you to <a href="https://www.postgresql.org/docs/current/functions-aggregate.html">lookup more</a> for your own use!</p><p>From the categorical section earlier, I mentioned that I suspect people could have different energy behavior patterns on weekdays vs. weekends, particularly in a single-family home in the US. Given my data set, I’m curious about this hypothesis and want to find the cumulative energy consumption across each day of the week. </p><p>To do so, I need to sum all the kWh data (<code>value_kwh</code>) in the power table, then group this data by the day of the week (<code>day_of_week</code>). In order to sum my data in PostgreSQL, I will use the <code><a href="https://www.postgresql.org/docs/current/functions-aggregate.html">SUM()</a></code> function. Because this is an aggregation function, I will have to include something that tells the database what to sum over. Since I want to know the sum of energy over each type of day, I can specify that the sum should be grouped by the <code>day_of_week</code> column using the <code>GROUP BY</code> keyword. I also added the <code>ORDER BY</code> keyword so that we could look at the weekly summed usage in order of the day. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="groupby"></a></p>
<pre><code class="language-sql">-- first I select the day_of_week col, then I define SUM(value_kwn) to get the sum of value_kwh col
SELECT day_of_week, SUM(value_kwh) --sum the value_sum column
FROM power_usage pu 
GROUP BY day_of_week -- group by the day_of_week col
ORDER BY day_of_week ASC; -- decided to order data by the day_of_week asc
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><p>After some quick investigation, the value <code>0</code> in the <code>day_of_week</code> column represents a Monday, thus my hypothesis may just be right. </p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day_of_week</th>
<th>sum</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3849</td>
</tr>
<tr>
<td>1</td>
<td>3959</td>
</tr>
<tr>
<td>2</td>
<td>3947</td>
</tr>
<tr>
<td>3</td>
<td>4094</td>
</tr>
<tr>
<td>4</td>
<td>3987</td>
</tr>
<tr>
<td>5</td>
<td>4169</td>
</tr>
<tr>
<td>6</td>
<td>4311</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>Something to note about the pandas <code>groupby()</code> function is that the group by column in the DataFrame will become the index column in the resulting aggregated DataFrame. This can add some extra work later on.</p><!--kg-card-begin: markdown--><pre><code class="language-python">day_agg_power = power_df.groupby('day_of_week').agg({'value_kwh' : 'sum'})
print(day_agg_power)
</code></pre>
<!--kg-card-end: markdown--><!--kg-card-begin: markdown--><p><a name="technique5"></a></p>
<!--kg-card-end: markdown--><h3 id="finding-abnormalities-in-the-database">Finding abnormalities in the database</h3><p>Clean data is fundamental in producing accurate analysis, and abnormalities/errors can be a huge roadblock to clean data. An essential part of evaluating data is finding abnormalities to determine if an error caused them. No data set is perfect, so it is vital to hunt down any possible errors in preparation for the cleaning stage of our analysis. Let's look at one example of how to uncover issues in a dataset using our example energy data.</p><p>After looking at the raw data in my <code>power_usage</code> table, I found that the <code>notes</code> and <code>day_of_week</code> columns <strong>should be the same for each hour across a single day </strong>(there are 24 hourly readings each day, and each hour is supposed to have the same <code>notes</code> value). In my experience with data analysis, I have found that notes which need to be recorded granularly often have mistakes within them. Because of this, I wanted to investigate whether or not this pattern was consistent across all of the data.</p><p>To check this hypothesis I can use the TimescaleDB <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/#time-bucket"><code>time_bucket()</code></a> function, PostgreSQL’s <code>GROUP BY</code> keyword, and <a href="https://www.postgresql.org/docs/13/queries-with.html">CTEs</a> (common table expressions). While the <code>GROUP BY</code> keyword is likely familiar to you by now, CTEs and the <code>time_bucket()</code> function are not. So, before I show the query, let’s dive into these two features.</p><p><strong>Time bucket function</strong></p><p>The <code>time_bucket()</code> function allows you to take a timestamp column like <code>startdate</code> in the <code>power_usage</code> table, and “bucket” the time based on the interval of your choice. For example, <code>startdate</code> is a timestamp column that shows values for each hour in a day. You could use the <code>time_bucket()</code> function on this column to “bucket” the hourly data into daily data. </p><p>Here is an image that shows how rows of the <code>startdate</code> column are bucketed into one aggregate row with <code>time_bucket(‘1 day’, startdate)</code>.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2021/09/image.png" class="kg-image" alt="Image showing how hourly data from 2016-01-01 00:00:00 - 2016-01-01 23:00:00 is bucketed to 2016-01-01 00:00:00 using the time_bucket() function " loading="lazy" width="1498" height="846" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/09/image.png 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/09/image.png 1000w, https://timescale.ghost.io/blog/content/images/2021/09/image.png 1498w" sizes="(min-width: 720px) 720px"></figure><p>After using the <code>time_bucket()</code> function in my query, I will have one unique “date” value for any data recorded over a single day. Since <code>notes</code> and <code>day_of_week</code> should also be unique over each day, if I <em>group by</em> these columns, I should get a single set of (date, day_of_week, notes) values. </p><p>Notice that to use <code>GROUP BY</code> in this scenario, I just list the columns I want to group on. Also, notice that I added <code>AS</code> after my <code>time_bucket()</code> function, this keyword allows you to "rename" columns. In the results, look for the <code>day</code> column, as this comes directly from my rename. </p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="timebucket"></a></p>
<pre><code class="language-sql">-- select the date through time_bucket and get unique values for each 
-- (date, day_of_week, notes) set
SELECT 
	time_bucket(interval '1 day', startdate ) AS day,
	day_of_week,
	notes
FROM power_usage pu 
GROUP BY day, day_of_week, notes;
</code></pre>
<!--kg-card-end: markdown--><p>Results: Some of the rows</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2017-01-19 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-10-06 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-06-04 00:00:00</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2019-01-03 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-10-01 00:00:00</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2019-11-27 00:00:00</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-06-15 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-11-16 00:00:00</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2017-05-18 00:00:00</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2018-07-17 00:00:00</td>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>2020-03-06 00:00:00</td>
<td>4</td>
<td>weekday</td>
</tr>
<tr>
<td>2018-10-14 00:00:00</td>
<td>6</td>
<td>weekend</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>In my Python code, I cannot just manipulate the table to print results, I actually have to create another column in the DataFrame.</p><!--kg-card-begin: markdown--><pre><code class="language-python">day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
print(power_unique)
</code></pre>
<!--kg-card-end: markdown--><p>Now that we understand the <code>time_bucket()</code> function a little better, let's look at CTEs and how they help me use this bucketed data to find any errors within the <code>notes</code> column. </p><p><strong>CTEs or common table expressions</strong></p><p>Getting unique sets of data only solves half of my problem. Now I want to verify if each day is truly mapped to a single <code>day_of_week</code> and <code>notes</code> pair. This is where CTE’s come in handy. With CTEs, you can build a query based on the results of others. </p><p>CTE’s use the following format 👇</p><!--kg-card-begin: markdown--><pre><code class="language-sql">WITH query_1 AS (
SELECT -- columns expressions
FROM table_name
)
SELECT --column expressions 
FROM query_1;
</code></pre>
<!--kg-card-end: markdown--><p><code>WITH</code> and <code>AS</code> allow you to define the first query, then in the second <code>SELECT</code> statement, you can call the results from the first query as if it were another table in the database. </p><p>To check that each day was “mapped” to a single <code>day_of_week</code> and <code>notes</code> pair, I need to aggregate the queried <code>time_bucket()</code> table above based upon the date column using another PostgreSQL aggregation function <code>COUNT()</code>. I am doing this because each day <em>should</em> only count one unique <code>day_of_week</code> and <code>notes</code> pair. If the count results in two or more, this implies that one day contains multiple <code>day_of_week</code> and <code>notes</code> pairs and thus is showing abnormal data. </p><p>Additionally, I will add a <code>HAVING</code> statement into my query so that the output only displays rows where the <code>COUNT(day)</code> is greater than one. I will also throw in an <code>ORDER BY</code> statement in case we have many different values greater than 1.</p><p>PostgreSQL code:</p><!--kg-card-begin: markdown--><p><a name="cte"></a></p>
<pre><code class="language-sql">WITH power_unique AS (
-- query from above, get unique set of (date, day_of_week, notes)
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	day_of_week,
	notes
FROM power_usage pu 
GROUP BY day, day_of_week, notes
)
-- calls data from the query above, using the COUNT() agg function
SELECT day, COUNT(day) 
FROM power_unique
GROUP BY day
HAVING COUNT(day) &gt; 1
ORDER BY COUNT(day) DESC;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><table>
<thead>
<tr>
<th>day</th>
<th>count</th>
</tr>
</thead>
<tbody>
<tr>
<td>2017-12-27 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2020-01-03 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2018-06-02 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2019-06-03 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2020-07-01 00:00:00</td>
<td>2</td>
</tr>
<tr>
<td>2016-07-21 00:00:00</td>
<td>2</td>
</tr>
</tbody>
</table>
<!--kg-card-end: markdown--><p>Python code:</p><p>Because of the count aggregation, I needed to rename the column in my <code>agg_power_unique</code> DataFrame so that I could then sort the values.</p><!--kg-card-begin: markdown--><pre><code class="language-python">day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
## If you ran the previous code snippet, this next line will error since you already ran it
power_df.insert(0, 'date_day', day_col)
power_unique = power_df[['date_day', 'day_of_week', 'notes']].drop_duplicates()
agg_power_unique = power_unique.groupby('date_day').agg({'date_day' : 'count'})
agg_power_unique = agg_power_unique.rename(columns={'date_day': 'count'})
print(agg_power_unique.loc[agg_power_unique['count'] &gt; 1].sort_values('count', ascending=False))
</code></pre>
<!--kg-card-end: markdown--><p>This query reveals that I indeed have a couple of data points that seem suspicious. Specifically, the dates [2017-12-27, 2020-01-03, 2018-06-02, 2019-06-03, 2020-07-01, 2016-07-21]. I will demonstrate how to fix these date issues in a later blog post about Cleaning techniques. </p><p>This example only shows one set of functions which helped me identify abnormal data through grouping and aggregation. You can use many other PostgreSQL and TimescaleDB functions to find other abnormalities in your data, like utilizing TimescaleDB’s <code>approx_percentile()</code> function (introducing this next) to find outliers in numeric columns by playing around with interquartile range calculations.</p><!--kg-card-begin: markdown--><p><a name="technique6"></a></p>
<!--kg-card-end: markdown--><h3 id="looking-at-general-trends">Looking at general trends</h3><p>Arguably, one of the more critical aspects of evaluating your data is understanding the general trends. To do this, you need to get basic statistics on your data using functions like mean, interquartile range, maximum values, and others. TimescaleDB has created many optimized hyperfunctions to perform these very tasks.</p><p>To calculate these values, I am going to introduce the following TimescaleDB functions: `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/approx_percentile/">approx_percentile</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/min_val/">min_val</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/max_val/">max_val</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/mean/">mean</a>`, `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/num_vals-pct/">num_vals</a>`,`<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile_agg/">percentile_agg</a>` (aggregate), and `<a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/percentile-aggregation-methods/tdigest/">tdigest</a>` (aggregate)</p><p>These hyperfunctions fall under the TimescaleDB category of two-step aggregation. Timescale designed each function to either be an aggregate or accessor function (I noted which ones above were aggregate functions). In two-step aggregation, the more programmatically taxing aggregate function is calculated first, then the accessor function is applied to it after.</p><figure class="kg-card kg-image-card"><img src="https://timescale.ghost.io/blog/content/images/2022/01/Untitled--1--1.png" class="kg-image" alt="acessor_function (aggregate_function())" loading="lazy"></figure><p>For specifics on how two-step aggregation works and why we use this convention, check out <a href="https://timescale.ghost.io/blog/blog/how-postgresql-aggregation-works-and-how-it-inspired-our-hyperfunctions-design-2/">David Kohn’s blog series on our hyperfunctions and two-step aggregation</a>. </p><p>I definitely want to understand the basic trends within the <code>power_usage</code> table for my data set. If I plan to do any type of modeling to predict future usage trends, I need to know some basic information about what this home’s usage looks like daily. </p><p>To understand the daily power usage data distribution, I’ll need to aggregate the energy usage per day. To do this, I can use the <code>time_bucket()</code> function I mentioned above, along with the <code>SUM()</code> operator. </p><!--kg-card-begin: markdown--><pre><code class="language-sql">-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	SUM(value_kwh)
FROM power_usage pu 
GROUP BY day;
</code></pre>
<!--kg-card-end: markdown--><p>I then want to find the 1st, 10th, 25th, 75th, 90th, and 99th percentiles, the median or 50th percentile, mean, minimum value, maximum value, number of readings in the table, and interquartile range of this data. Creating the query with a CTE simplifies the process by only calculating the sum of data once and reusing the value multiple times.</p><p>PostgreSQL:</p><!--kg-card-begin: markdown--><p><a name="stats"></a></p>
<pre><code class="language-sql">WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	SUM(value_kwh) AS sum_kwh
FROM power_usage pu 
GROUP BY day
)
-- using two-step aggregation functions to find stats
SELECT approx_percentile(0.01,percentile_agg(sum_kwh)) AS &quot;1p&quot;,
approx_percentile(0.10,percentile_agg(sum_kwh)) AS &quot;10p&quot;,
approx_percentile(0.25,percentile_agg(sum_kwh)) AS &quot;25p&quot;,
approx_percentile(0.5,percentile_agg(sum_kwh)) AS &quot;50p&quot;,
approx_percentile(0.75,percentile_agg(sum_kwh)) AS &quot;75p&quot;,
approx_percentile(0.90,percentile_agg(sum_kwh)) AS &quot;90p&quot;,
approx_percentile(0.99,percentile_agg(sum_kwh)) AS &quot;99p&quot;,
min_val(tdigest(100, sum_kwh)),
max_val(tdigest(100, sum_kwh)),
mean(percentile_agg(sum_kwh)),
num_vals(percentile_agg(sum_kwh)),
-- you can use subtraction to create an output for the IQR
approx_percentile(0.75,percentile_agg(sum_kwh)) - approx_percentile(0.25,percentile_agg(sum_kwh)) AS iqr
FROM power_usage_sum pus;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><div style="width:100%;overflow:auto;">
<table>
<thead>
<tr>
<th>1p</th>
<th>10p</th>
<th>25p</th>
<th>50p</th>
<th>75p</th>
<th>90p</th>
<th>99p</th>
<th>min_val</th>
<th>max_val</th>
<th>mean</th>
<th>num_vals</th>
<th>iqr</th>
</tr>
</thead>
<tbody>
<tr>
<td>0.0</td>
<td>4.0028</td>
<td>6.9936</td>
<td>16.0066</td>
<td>28.9914</td>
<td>38.9781</td>
<td>56.9971</td>
<td>0.0</td>
<td>73.0</td>
<td>18.9025</td>
<td>1498.0</td>
<td>21.9978</td>
</tr>
</tbody>
</table>
</div><!--kg-card-end: markdown--><p>Python:</p><p>Something that really stumped me when initially writing this code snippet was that I had to use <code>astype(float)</code> on my <code>value_kwh</code> column to use describe. I have probably spent the combined time of a day over my life trying to deal with value types being incompatible with certain functions. This is another reason why I enjoy data munging with the intuitive functionality of PostgreSQL and TimescaleDB; these types of problems just happen less often. And let me tell you, the faster and painless data munging is the happier I am!</p><!--kg-card-begin: markdown--><pre><code class="language-python">agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum'})
# need to make the value_kwh column the right data type
agg_power.value_kwh = agg_power.value_kwh.astype(float)
describe = agg_power.value_kwh.describe()
percentiles = agg_power.value_kwh.quantile([.01, .1, .9, .99])
q75, q25 = np.percentile(agg_power['value_kwh'], [75 ,25])
iqr = q75 - q25
print(describe, percentiles, iqr)
</code></pre>
<!--kg-card-end: markdown--><p>Another technique you may want to use for accessing the distribution of data in a column is a histogram. Generally, creating an image is where Python and other tools shine. However, I often need to glance at a histogram to check for any blatant anomalies when evaluating data. While this one technique in TimescaleDB may not be as simple as the Python solution, I can still do this directly in my database, which can be convenient. </p><p>To create a histogram in the database, we will need to use the TimescaleDB <a href="https://docs.timescale.com/api/latest/hyperfunctions/histogram/#required-arguments"><code>histogram()</code></a> function, <a href="https://www.postgresql.org/docs/13/functions-array.html"><code>unnest()</code></a>, <a href="https://www.postgresql.org/docs/13/functions-srf.html"><code>generate_series()</code></a>, <a href="https://www.postgresql.org/docs/13/functions-string.html"><code>repeat()</code></a>, and CTE’s.  <br></p><p>The <code>histogram()</code> function takes in the column you want to analyze and produces an array object which contains the frequency values across the number of buckets plus two (one additional bucket for values below the lowest bucket and above the highest bucket). You can then use PostgreSQL’s <code>unnest()</code>  function to break up the array into a single column with rows equal to two plus the number of specified buckets. </p><p>Once you have a column with bucket frequencies, you can then create a histogram “image” using the PostgreSQL <code>repeat()</code> function. The first time I saw someone use the <code>repeat()</code> function in this way was in <a href="https://hakibenita.com/sql-for-data-analysis">Haki Benita’s blog post</a>, which I recommend reading if you are interested in learning more PostgreSQL analytical techniques. The <code>repeat()</code> function essentially creates a string that repeats chosen characters a specified number of times. To use the histogram frequency values, you just input the unnested histogram in for the repeating argument. </p><p>Additionally, I find it useful to know the approximate starting values for each bucket in the histogram. This gives me a better picture of what values are occurring when. To approximate the bin values, I use the PostgreSQL <code>generate_series()</code> function along with some algebra,</p><!--kg-card-begin: markdown--><pre><code class="language-sql">(generate_series(-1, [number_of_buckets]) * [max_val - min_val]::float/[number_of_buckets]::float) + [min_val]
</code></pre>
<!--kg-card-end: markdown--><p>When I put all these techniques together, I am able to get a histogram with the following,</p><p>PostgreSQL:</p><!--kg-card-begin: markdown--><p><a name="hist"></a></p>
<pre><code class="language-sql">WITH power_usage_sum AS (
-- bucket the daily data using time_bucket, sum kWh over each bucketed day
SELECT 
	time_bucket(INTERVAL '1 day', startdate ) AS day,
	SUM(value_kwh) AS sum_kwh
FROM power_usage pu 
GROUP BY day
),
histogram AS (
-- I input the column = sum_kwh, the min value = 0, max value = 73, and number of buckets = 30
SELECT histogram(sum_kwh, 0, 73, 30)
FROM power_usage_sum w 
)
SELECT 
-- I use unnest to create the first column
   unnest(histogram) AS count, 
-- I use my approximate bucket values function
   (generate_series(-1, 30) * 73::float/30::float) + 0 AS approx_bucket_start_val,
-- I then use the repeat function to display the frequency
   repeat('■', unnest(histogram)) AS frequency
FROM histogram;
</code></pre>
<!--kg-card-end: markdown--><p>Results:</p><!--kg-card-begin: markdown--><div style="width:100%;overflow:auto;">
<table>
<thead>
<tr>
<th>count</th>
<th>approx_bucket_start_val</th>
<th>frequency</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>-2.43</td>
<td></td>
</tr>
<tr>
<td>83</td>
<td>0.0</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>104</td>
<td>2.43</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>207</td>
<td>4.87</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>105</td>
<td>7.3</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>150</td>
<td>9.73</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>76</td>
<td>12.17</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>105</td>
<td>14.6</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>62</td>
<td>17.03</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>48</td>
<td>19.47</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>77</td>
<td>21.9</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>35</td>
<td>24.33</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>83</td>
<td>26.77</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>42</td>
<td>29.2</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>72</td>
<td>31.63</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>46</td>
<td>34.07</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>51</td>
<td>36.5</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>39</td>
<td>38.93</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>32</td>
<td>41.37</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>24</td>
<td>43.8</td>
<td>■■■■■■■■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>16</td>
<td>46.23</td>
<td>■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>17</td>
<td>48.67</td>
<td>■■■■■■■■■■■■■■■■■</td>
</tr>
<tr>
<td>4</td>
<td>51.1</td>
<td>■■■■</td>
</tr>
<tr>
<td>3</td>
<td>53.53</td>
<td>■■■</td>
</tr>
<tr>
<td>5</td>
<td>55.97</td>
<td>■■■■■</td>
</tr>
<tr>
<td>4</td>
<td>58.4</td>
<td>■■■■</td>
</tr>
<tr>
<td>5</td>
<td>60.83</td>
<td>■■■■■</td>
</tr>
<tr>
<td>1</td>
<td>63.27</td>
<td>■</td>
</tr>
<tr>
<td>0</td>
<td>65.7</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>68.13</td>
<td>■</td>
</tr>
<tr>
<td>0</td>
<td>70.57</td>
<td></td>
</tr>
<tr>
<td>1</td>
<td>73.0</td>
<td>■</td>
</tr>
</tbody>
</table>
</div><!--kg-card-end: markdown--><p>Python:</p><p>This Python code is definitively better. It’s simple and relatively painless. I wanted to show this comparison to provide an option for displaying a histogram directly in your database vs. having to pull the data into a pandas DataFrame then displaying it. Doing the histogram in the database just helps me to keep focus while evaluating the data.</p><!--kg-card-begin: markdown--><pre><code class="language-python">plt.hist(agg_power.value_kwh, bins=30)
plt.show()
</code></pre>
<!--kg-card-end: markdown--><h2 id="wrap-up"><br>Wrap Up</h2><p>Hopefully, after reading through these various evaluating techniques, you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. Evaluating data directly in the database often saved me time without sacrificing any functionality. If you are looking to save time and effort while evaluating your data for analysis, definitely consider using PostgreSQL and TimescaleDB. </p><p>In my next posts, I will go over techniques to clean and transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting deep-dive data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).</p><p>If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our <a href="https://slack.timescale.com/">community Slack</a>, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).</p><p>If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install TimescaleDB and manage it on your current PostgreSQL instances</a>. We also have a bunch of <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">great tutorials</a> to help get you started.</p><p>Until next time!</p><p><strong>Functionality Glossary </strong></p><!--kg-card-begin: markdown--><ul>
<li><a href="#select"><code>SELECT</code></a></li>
<li><a href="#select"><code>FROM</code></a></li>
<li><a href="#orderby"><code>ORDER BY</code></a></li>
<li><a href="#orderby"><code>DESC</code></a></li>
<li><a href="#orderby"><code>ASC</code></a></li>
<li><a href="#limit"><code>LIMIT</code></a></li>
<li><a href="#orderby"><code>DISTINCT</code></a></li>
<li><a href="#groupby"><code>GROUP BY</code></a></li>
<li><a href="#groupby"><code>SUM()</code></a></li>
<li><a href="#timebucket"><code>time_bucket(&lt;time_interval&gt;, &lt;time_col&gt;)</code></a></li>
<li><a href="#cte">CTE’s <code>WITH</code> <code>AS</code></a></li>
<li><a href="#cte"><code>COUNT()</code></a></li>
<li><a href="#stats"><code>approx_percentile()</code></a></li>
<li><a href="#stats"><code>min_val()</code></a></li>
<li><a href="#stats"><code>max_val()</code></a></li>
<li><a href="#stats"><code>mean()</code></a></li>
<li><a href="#stats"><code>num_vals()</code></a></li>
<li><a href="#stats"><code>percentile_agg()</code> [aggregate]</a></li>
<li><a href="#stats"><code>tdigest()</code> [aggregate]</a></li>
<li><a href="#hist"><code>histogram()</code></a></li>
<li><a href="#hist"><code>unnest()</code></a></li>
<li><a href="#hist"><code>generate_series()</code></a></li>
<li><a href="#hist"><code>repeat()</code></a></li>
</ul>
<!--kg-card-end: markdown--><p><br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Speeding up data analysis with TimescaleDB and PostgreSQL]]></title>
            <description><![CDATA[Is your data analysis process as fast and efficient as it could be? This four-part blog series will outline common data analysis problems and how TimescaleDB and PostgreSQL fixed them by making data munging tasks within analysis fast, efficient, and easily accessible. ]]></description>
            <link>https://www.tigerdata.com/blog/speeding-up-data-analysis</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/speeding-up-data-analysis</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[General]]></category>
            <dc:creator><![CDATA[Miranda Auhl]]></dc:creator>
            <pubDate>Thu, 09 Sep 2021 15:32:12 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2021/09/magnify.jpeg">
            </media:content>
            <content:encoded><![CDATA[<p><a href="https://timescale.ghost.io/blog/time-series-data/">Time-series data</a> is everywhere, and it drives decision-making in every industry. Time-series data collectively represents how a system, process, or behavior changes over time. Understanding these changes helps us to solve complex problems across numerous industries, including <a href="https://timescale.ghost.io/blog/blog/simplified-prometheus-monitoring-for-your-entire-organization-with-promscale/">observability</a>, <a href="https://timescale.ghost.io/blog/blog/how-messari-uses-data-to-open-the-cryptoeconomy-to-everyone/">financial services</a>, <a href="https://timescale.ghost.io/blog/blog/how-meter-group-brings-a-data-driven-approach-to-the-cannabis-production-industry/">Internet of Things</a>, and even <a href="https://timescale.ghost.io/blog/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql/">professional football</a>.</p><p>Depending on the type of application they’re building, developers end up collecting millions of rows of time-series data (and sometimes millions of rows of data every day or even every hour!). Making sense of this high-volume, high-fidelity data takes a particular set of data analysis skills that aren’t often exercised as part of the classic developer skillset. To perform time-series analysis that goes beyond basic questions, developers and data analysts need specialized tools, and as <a href="https://db-engines.com/en/ranking_categories">time-series data grows in prominence</a>, the <strong>efficiency</strong> of these tools becomes even more important.</p><p>Often, data analysts’ work can be boiled down to <strong>evaluating</strong>, <strong>cleaning</strong>, <strong>transforming</strong>, and <strong>modeling</strong> data. In my experience, I’ve found these actions are necessary for me to gain understanding from data, and I will refer to this as the “data analysis life cycle” throughout this post.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://timescale.ghost.io/blog/content/images/2021/09/data-analysis-lifecycle.jpeg" class="kg-image" alt="Graphic showing the “data analysis lifecycle”, Evaluate -> Clean -> Transform -> Model" loading="lazy" width="1838" height="466" srcset="https://timescale.ghost.io/blog/content/images/size/w600/2021/09/data-analysis-lifecycle.jpeg 600w, https://timescale.ghost.io/blog/content/images/size/w1000/2021/09/data-analysis-lifecycle.jpeg 1000w, https://timescale.ghost.io/blog/content/images/size/w1600/2021/09/data-analysis-lifecycle.jpeg 1600w, https://timescale.ghost.io/blog/content/images/2021/09/data-analysis-lifecycle.jpeg 1838w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Data analysis lifecycle</span></figcaption></figure><p>Excel, R, and Python are arguably some of the most commonly used data analysis tools, and, while they are all fantastic tools, they may not be suited for every job. Speaking from experience, these tools can be especially inefficient for “data munging” at the early stages of the lifecycle; specifically, the <strong>evaluating data</strong>, <strong>cleaning data</strong>, and <strong>transforming data</strong> steps involved in pre-modeling work.</p><p>As I’ve worked with larger and more complex datasets, I’ve come to believe that databases built for specific types of data - such as time-series data - are more effective for data analysis. </p><p>For background, <a href="https://www.timescale.com/products">TimescaleDB</a> is a <em>relational </em>database for time-series data. If your analysis is based on time-series datasets, TimescaleDB can be a great choice not only for its scalability and dependability but also for its relational nature. Because TimescaleDB is packaged as an extension to PostgreSQL, you’ll be able to look at your time-series data alongside your relational data and get even more insight. (I recognize that as a Developer Advocate at Timescale, I might be a <em>little </em>biased 😊…)</p><p>In this blog series, I will discuss each of the three data munging steps in the analysis lifecycle in-depth and demonstrate how to use TimescaleDB as a powerful tool for your data analysis.</p><p>In this introductory post, I'll explore a few of the common frustrations that I experienced with popular data analysis tools, and from there, dive into how I’ve used TimescaleDB to help alleviate each of those pain points.</p><p>In future posts we'll look at:</p><ul><li>How TimescaleDB data analysis functionality can replace work commonly performed in Python and pandas</li><li>How TimescaleDB vs. Python and pandas compare (benchmarking a standard data analysis workflow)</li><li>How to use TimescaleDB to conduct an end-to-end, deep-dive data analysis, using real yellow taxi cab data from the <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page">New York City Taxi and Limousine Commission</a> (NYC TLC).</li></ul><p>If you are interested in trying out TimescaleDB and PostgreSQL functionality right away, <a href="https://www.timescale.com/timescale-signup">sign up for a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install and manage it on your instances</a>. (You can also learn more by <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">following one of our many tutorials</a>.)</p><h2 id="common-data-analysis-tools-and-%E2%80%9Cthe-problem%E2%80%9D">Common data analysis tools and “the problem”</h2><p>As we’ve discussed, the three most popular tools used for data analysis are Excel, R, and Python. While they are great tools in their own right, they are not optimized to efficiently perform every step in the analysis process. </p><p>In particular, most data scientists (including myself!) struggle with similar issues as the amount of data grows or the same analysis needs to be redone month after month.</p><p>Some of these struggles include:</p><ul><li>Data storage and access: Where is the best place to store and maintain my data for analysis?</li><li>Data size and its influence on the analysis: How can I improve efficiency for data munging tasks, especially as data scales?</li><li>Script storage and accessibility: What can I do to improve data munging script storage and maintenance?</li><li>Easily utilizing new technologies: How could I set up my data analysis toolchain to allow for easy transitions to new technologies?</li></ul><p>So buckle in, keep your arms and legs in the vehicle at all times, and let’s start looking at these problems!</p><h2 id="data-analysis-issue-1-storing-and-accessing-data"><br>Data analysis issue #1: storing and accessing data</h2><p>To do data analysis, you need access to… data.</p>
<!--kg-card-begin: html-->
<center>
    <iframe src="https://giphy.com/embed/rIq6ASPIqo2k0" width="100px" 
            height="100px" style="min-width: 100%" frameBorder="0" class="giphy-embed" position="relative" aria-label="Image of Data from Star Trek smiling"></iframe>
    <p>
        <a href="https://giphy.com/gifs/rIq6ASPIqo2k0" position="center">via GIPHY</a>
    </p>
</center>
<!--kg-card-end: html-->
<p>Managing where that data lives, and how easily you can access it is the preliminary (and often most important) step in the analysis journey. Every time I begin a new data analysis project, this is often where I run into my first dilemma. Regardless of the original data source, I always ask “where is the best place to store and maintain the data as I start working through the data munging process?”</p><p>Although it's becoming more common for data analysts to use databases for storing and querying data, it's still not ubiquitous. Too often, raw data is provided in a stream of CSV files or APIs that produce JSON. While this may be manageable for smaller projects, it can quickly become overwhelming to maintain and difficult to manage from project to project. </p><p>For example, let’s consider how we might use Python as our data analysis tool of choice. </p><p>While using Python for data analysis, I have the option of ingesting data through files/APIs OR a database connection. </p><p>If I used files or APIs for querying data during analysis, I often faced questions like:</p><ul><li>Where are the files located? What happens if the URL or parameters change for an API?</li><li>What happens if duplicate files are made? And what if updates are made to one file, and not the other?</li><li>How do I best share these files with colleagues?</li><li>What happens if multiple files depend on one another?</li><li>How do I prevent incorrect data from being added to the wrong column of a CSV? (ie. a decimal where a string should be)</li><li>What about very large files? What is the ingestion rate for a 10MB, 100MB, 1GB, 1TB sized file?</li></ul><p>After running into these initial problems project after project, I knew there had to be a better solution.<strong> I knew that I needed a single source of truth for my data – and it started to become clear that a specialized SQL database might be my answer!</strong></p><p>Now, let’s consider if I were to connect to TimescaleDB. </p><p>By importing my time-series data into TimescaleDB, I can create one source of truth for all of my data. As a result, collaborating with others becomes as simple as sharing access to the database. Any modifications to the data munging process within the database means that all users have access to the same changes at the same time, opposed to parsing through CSV files to verify I have the right version. </p><p>Additionally, databases can typically handle much larger data loads than a script written in Python or R. TimescaleDB was built to house, maintain, and query terabytes of data efficiently and cost-effectively (both computationally speaking AND for your wallet). With features like <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregates</a> and native <a href="https://www.tigerdata.com/blog/building-columnar-compression-in-a-row-oriented-database" rel="noreferrer">columnar compression</a>, storing and analyzing years of time-series data became efficient while still being easily accessible.</p><p>In short, managing data over time, especially when it comes from different sources, can be a nightmare to maintain and access efficiently. But, it doesn’t have to be.</p><h2 id="data-analysis-issue-2-maximizing-analysis-speed-and-computation-efficiency-the-bigger-the-dataset-the-bigger-the-problem">Data analysis issue #2: maximizing analysis speed and computation efficiency (the bigger the dataset, the bigger the problem)</h2><p>Excel, R, and Python are all capable of performing the first three steps of the data analysis “lifecycle”: evaluating, cleaning, and transforming data. However, these technologies are not generally optimized for speed or computational efficiency during the process. </p><p>In numerous projects over the years, I’ve found that as the size of my dataset increased, the process of importing, cleaning, and transforming it became more difficult, time-consuming, and, in some cases impossible. For Python and R, parsing through large amounts of data seemed to take forever, and Excel would simply crash once hitting millions of rows.  </p><p>Things became <em>especially </em>difficult when I needed to create additional tables for things like aggregates or data transformations: some lines of code could take seconds or, in extreme cases, minutes to run depending on the size of the data, the computer I was using, or the complexity of the analysis. </p><p>While seconds or minutes may not <em>seem</em> like a lot, it adds up and amounts to hours or days of lost productivity when you’re performing analysis that needs to be run hundreds or thousands of times a month! </p><p>To illustrate, let’s look at a Python example once again. </p><p>Say I was working with <a href="https://www.kaggle.com/srinuti/residential-power-usage-3years-data-timeseries">this IoT data set taken from Kaggle</a>. The set contains two tables, one specifying energy consumption for a single home in Houston Texas, and the other documenting weather conditions.</p><p>To run through analysis with Python, the first steps in my analysis would be to pull in the data and observe it. </p><p>When using Python to do this, I would run code like this 👇</p><pre><code class="language-python">import psycopg2
import pandas as pd
import configparser


## use config file for database connection information
config = configparser.ConfigParser()
config.read('env.ini')

## establish conntection
conn = psycopg2.connect(database=config.get('USERINFO', 'DB_NAME'), 
                        host=config.get('USERINFO', 'HOST'), 
                        user=config.get('USERINFO', 'USER'), 
                        password=config.get('USERINFO', 'PASS'), 
                        port=config.get('USERINFO', 'PORT'))

## define the queries for selecting data out of our database                        
query_weather = 'select * from weather'
query_power = 'select * from power_usage'

## create cursor to extract data and place it into a DataFrame
cursor = conn.cursor()
cursor.execute(query_weather)
weather_data = cursor.fetchall()
cursor.execute(query_power)
power_data = cursor.fetchall()
## you will have to manually set the column names for the data frame
weather_df = pd.DataFrame(weather_data, columns=['date','day','temp_max','temp_avg','temp_min','dew_max','dew_avg','dew_min','hum_max','hum_avg','hum_min','wind_max','wind_avg','wind_min','press_max','press_avg','press_min','precipit','day_of_week'])
power_df = pd.DataFrame(power_data, columns=['startdate', 'value_kwh', 'day_of_week', 'notes'])
cursor.close()

print(weather_df.head(20))
print(power_df.head(20))
</code></pre><p>Altogether, this code took 2.718 seconds to run using my <a href="https://www.apple.com/shop/buy-mac/macbook-pro/16-inch-space-gray-2.3ghz-8-core-processor-1tb#">2019 MacBook Pro laptop with 32GB memory</a>. </p><p>But, what about if I run this equivalent script with SQL in the database?</p><pre><code class="language-sql">select * from weather
select * from power_usage</code></pre><table>
<thead>
<tr>
<th>startdate</th>
<th>value_kwh</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06 01:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 02:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 03:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 04:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 05:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 06:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 07:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 08:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 09:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 10:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 11:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 12:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 13:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 14:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 15:00:00</td>
<td>0</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 16:00:00</td>
<td>1</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-06 17:00:00</td>
<td>4</td>
<td>2</td>
<td>weekday</td>
</tr>
</tbody>
</table>
<p>This query only took 0.342 seconds to run, almost 8x faster when compared to the Python script. </p><p>This time difference makes a lot of sense when we consider that Python must connect to a database, then run the SQL query, then parse the retrieved data, and then import it into a DataFrame. While almost three seconds is fast, this extra time for processing adds up as the script becomes more complicated and more data munging tasks are added. </p><p>Pulling in the data and observing it is only the beginning of my analysis! What happens when I need to perform a transforming task, like aggregating the data?</p><p>For this dataset, when we look at the <code>power_usage</code> table - as seen above - kWh readings are recorded every hour. If I want to do daily analysis, I have to aggregate the hourly data into “day buckets”.  </p><p>If I used Python for this aggregation, I could use something like 👇</p><pre><code class="language-python"># sum power usage by day, bucket by day
## create column for the day 
day_col = pd.to_datetime(power_df['startdate']).dt.strftime('%Y-%m-%d')
power_df.insert(0, 'date_day', day_col)
agg_power = power_df.groupby('date_day').agg({'value_kwh' : 'sum', 'day_of_week' : 'unique', 'notes' : 'unique' })
print(agg_power)</code></pre><p>...which takes 0.49 seconds to run (this does not include the time for importing our data).</p><p>Alternatively, with the TimescaleDB <a href="https://docs.timescale.com/api/latest/hyperfunctions/time_bucket/"><code>time_bucket()</code></a> function, I could do this aggregation directly in the database using the following query 👇</p><pre><code class="language-sql">select 
	time_bucket(interval '1 day', startdate ) as day,
	sum(value_kwh),
	day_of_week,
	notes
from power_usage pu 
group by day, day_of_week, notes
order by day</code></pre><table>
<thead>
<tr>
<th>day</th>
<th>sum</th>
<th>day_of_week</th>
<th>notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016-01-06 00:00:00</td>
<td>27</td>
<td>2</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-07 00:00:00</td>
<td>42</td>
<td>3</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-08 00:00:00</td>
<td>51</td>
<td>4</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-09 00:00:00</td>
<td>50</td>
<td>5</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-01-10 00:00:00</td>
<td>45</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-01-11 00:00:00</td>
<td>22</td>
<td>0</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-01-12 00:00:00</td>
<td>12</td>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-02-06 00:00:00</td>
<td>32</td>
<td>5</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-02-07 00:00:00</td>
<td>62</td>
<td>6</td>
<td>weekend</td>
</tr>
<tr>
<td>2016-02-08 00:00:00</td>
<td>48</td>
<td>0</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-02-09 00:00:00</td>
<td>23</td>
<td>1</td>
<td>weekday</td>
</tr>
<tr>
<td>2016-02-10 00:00:00</td>
<td>24</td>
<td>2</td>
<td>weekday</td>
</tr>
</tbody>
</table>
<p>...which only takes 0.087 seconds and is over 5x faster than the Python script. </p><p>You can start to see a pattern here. </p><p>As mentioned above, <a href="https://docs.timescale.com/timescaledb/latest/overview/core-concepts/#why-use-timescaledb">TimescaleDB was created to efficiently query and store time-series data</a>. But simply querying data only scratches the surface of the possibilities TimescaleDB and PostgreSQL functionality provides. </p><p>TimescaleDB and PostgreSQL offer a wide range of tools and functionality that can replace the need for additional tools to evaluate, clean, and transform your data. Some of the TimescaleDB functionality includes continuous aggregates, compression, and <a href="https://docs.timescale.com/api/latest/hyperfunctions/">hyperfunctions</a>; all of which allow you to do nearly all data munging tasks directly within the database. </p><p>When I performed the evaluating, cleaning, and transforming steps of my analysis directly within TimescaleDB, I cut out the need to use additional tools - like Excel, R, or Python - for data munging tasks. I could pull cleaned and transformed data, ready for modeling, directly into Excel, R, or Python.</p><h2 id="data-analysis-issue-3-storing-and-maintaining-scripts-for-data-analysis">Data analysis issue #3: storing and maintaining scripts for data analysis</h2><p>Another potential downside of exclusively using Excel, R, or Python for the entire data analysis workflow, is that all of the logic for analyzing the data is contained within a script file. Similar to the issues of having many different data sources, maintaining script files can be inconvenient and messy.  </p><p>Some common issues that I - and many data analysts - run into include:</p><ul><li>Losing files</li><li>Unintentionally creating duplicate files</li><li>Changing or updating some files but not others</li><li>Needing to write and run scripts to access transformed data (see below example)</li><li>Spending time re-running scripts whenever new raw data is added (see below example)</li></ul><p>While you can use a code repository to overcome some of these issues, it will not fix the last two.  </p><p>Let’s consider our Python scenario again. </p><p>Say that I used a Python script exclusively for all my data analysis tasks. What happens if I need to export my transformed data to use in a report on energy consumption in Texas? </p><p>Likely, I would have to add some code within the script to allow for exporting the data and then run the script again to actually export it. Depending on the content of the script and how long it takes to transform the data, this could be pretty inconvenient and inefficient.</p><p>What if I also just got a bunch of new energy usage and weather data? For me to incorporate this new raw data into existing visualizations or reports, I would need to run the script again and make sure that all of my data munging tasks run as expected. </p><p>Database functions, like continuous aggregates and materialized views, can create transformed data that can be stored and queried directly from your database without running a script. Additionally, I can create policies for continuous aggregates to regularly keep this transformed data up-to-date any time raw data is modified. Because of these policies, I wouldn't have to worry about running scripts to re-transform data for use, making access to updated data efficient. With TimescaleDB, many of the data munging tasks in the analysis lifecycle that you would normally do within your scripts can be accomplished using built-in TimescaleDB and PostgreSQL functionality.</p><h2 id="data-analysis-issue-4-easily-utilizing-new-or-additional-technologies">Data analysis issue #4: easily utilizing new or additional technologies</h2><p>Finally, the last step in the data analysis lifecycle: modeling. If I wanted to use a new tool or technology to create a visualization, it was difficult to easily take my transformed data and use it for modeling or visualizations elsewhere.</p><p>Python, R, and Excel are all pretty great for their visualization and modeling capabilities. However, what happens when your company or team wants to adopt a <em>new</em> tool?</p><p>In my experience, this often means either adding on another step to the analysis process, or rediscovering how to perform the evaluating, cleaning, and transforming steps within the new technology.</p><p>For example, in one of my previous jobs, I was asked to convert a portion of my analysis into Power BI for business analytics purposes. Some of the visualizations my stakeholders wanted required me to access transformed data from my Python script. At the time, I had the option to export the data from my Python script or figure out how to transform the data in Power BI directly. Both options were not ideal and were guaranteed to take extra time. </p><p>When it comes to adopting new visualization or modeling tools, using a database for evaluating, cleaning, and transforming data can again work in your favor. Most visualization tools - such as <a href="https://grafana.com/">Grafana</a>, <a href="https://www.metabase.com/">Metabase</a>, or <a href="https://powerbi.microsoft.com/">Power BI</a> - allow users to import data from a database directly. </p><p>Since I can do most of my data munging tasks within TimescaleDB, adding or switching tools - such as using Power BI for dashboard capabilities - becomes as simple as connecting to my database, pulling in the munged data,  and using the new tool for visualizations and modeling.</p><h2 id="wrapping-up">Wrapping up</h2><p>In summary, Excel, R, and Python are all great tools to use for analysis, but may not be the best tools for every job. Case in point: my struggles with time-series data analysis, especially on big datasets. </p><p>With TimescaleDB functionality, you can house your data and perform the evaluating, cleaning, and transforming aspects of data analysis, all directly within your database – and solve a lot of common data analysis woes in the process (which I’ve - hopefully! - demonstrated in this post)</p><p>In the blog posts to come, I’ll explore TimescaleDB and PostgreSQL functionality compared to Python, benchmark TimescaleDB performance vs. Python and pandas for data munging tasks, and conduct a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations). </p><p>If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, <a href="https://slack.timescale.com/">join our <strong>community Slack</strong></a>, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!). </p><p>If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for <a href="https://www.timescale.com/timescale-signup">a free 30-day trial</a> or <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/install-timescaledb/self-hosted/">install TimescaleDB</a> and manage it on your current PostgreSQL instances. We also have a bunch of great <a href="https://docs.timescale.com/timescaledb/latest/tutorials/">tutorials</a> to help get you started.</p><p>Until next time!<br></p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Hacking NFL Data With PostgreSQL, TimescaleDB, and SQL]]></title>
            <description><![CDATA[Learn how to use time-series data provided by the NFL to uncover insights.]]></description>
            <link>https://www.tigerdata.com/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql</link>
            <guid isPermaLink="true">https://www.tigerdata.com/blog/hacking-nfl-data-with-postgresql-timescaledb-and-sql</guid>
            <category><![CDATA[PostgreSQL]]></category>
            <category><![CDATA[Time Series Data]]></category>
            <category><![CDATA[Analytics]]></category>
            <dc:creator><![CDATA[Attila Toth]]></dc:creator>
            <pubDate>Tue, 27 Jul 2021 13:32:25 GMT</pubDate>
            <media:content medium="image" href="https://timescale.ghost.io/blog/content/images/2021/07/ameer-basheer-Yzef5dRpwWg-unsplash--1-.jpg">
            </media:content>
            <content:encoded><![CDATA[<p><em>Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics—and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience—all with PostgreSQL, standard SQL, and freely available extensions.</em></p><p>Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. <a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-intraday-stocks/">Stock</a>, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/analyze-cryptocurrency-data/">cryptocurrency</a>, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nyc-taxi-cab/">IoT</a>, and <a href="https://docs.timescale.com/timescaledb/latest/tutorials/promscale/">infrastructure metrics</a> data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for <a href="https://twitter.com/hashtag/TimeSeries">#timeseries</a> or <a href="https://twitter.com/hashtag/TimescaleDB">#TimescaleDB</a>, and you're sure to find questions about high-frequency trading or massive-scale observability data with tools like Prometheus.</p><p>You can imagine our excitement, then, when we happened upon the <a href="https://operations.nfl.com/gameday/analytics/big-data-bowl/">NFL Big Data Bowl</a>, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models. </p><p>Did the NFL <strong><em>really</em></strong> give access to 18+ million rows of detailed play data from every regular season NFL game?</p>
<!--kg-card-begin: html-->
<div style="width:100%;height:0;padding-bottom:100%;position:relative;"><iframe src="https://giphy.com/embed/2w3uNmvsIZ09hNzgzo" width="100px" height="100px" style="position:absolute; min-width: 100%; min-height: 100%" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div><p><a href="https://giphy.com/gifs/nfl-colts-indianapolis-2w3uNmvsIZ09hNzgzo">via GIPHY</a></p>
<!--kg-card-end: html-->
<p>For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials). </p><p>Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.  </p><p>So, at first glance, football game metrics might not immediately jump out as anything special. </p><p>But then the NFL did something pretty ambitious and amazing. </p><p>All <a href="https://operations.nfl.com/gameday/technology/nfl-next-gen-stats/">NFL players are equipped with RFID chips</a> that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize.  The NFL started tracking and storing data for every player on the field, for every play, for every game. </p><p>As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.</p><p>This isn’t ordinary data. <a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">This is time-series data</a>. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.</p><p>Time-series data comes at you fast, sometimes generating millions of data points per second (<a href="https://timescale.ghost.io/blog/blog/what-the-heck-is-time-series-data-and-why-do-i-need-a-time-series-database-dcf3b1b18563/">read more about time-series data</a>). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a petabyte-scale, relational database for <a href="https://www.tigerdata.com/blog/time-series-introduction" rel="noreferrer">time series</a>. </p><p>We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season. </p><p>Read on for more information about the <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview">NFL’s dataset</a> and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.</p><p><strong>If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service</strong>: create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.</p><p>If you’re new to time-series data or just have some questions you’d like to ask about the dataset, <a href="https://slack.timescale.com">join our public Slack community</a>, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.</p><h2 id="the-nfl-time-series-dataset">The NFL Time Series Dataset</h2><p><br>Over the last few years, the NFL and Kaggle have collaborated on the <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview">NFL Big Data Bowl</a>. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.</p><p>Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass-play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/discussion/217170">last year’s winners</a>.)</p><p>(Keep watching the <a href="https://operations.nfl.com/gameday/analytics/big-data-bowl/">NFL website</a> for more information on the next Big Data Bowl.)</p><h2 id="accessing-the-data">Accessing the Data</h2><p><br>For the purposes of this blog post and accompanying tutorial, we will use <a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview">the sample data provided by the NFL</a>. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.</p><p>This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets. </p><p>If you want to follow along - or recreate! - the queries we go through below, <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">follow our tutorial</a> to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any <a href="https://console.cloud.timescale.com" rel="noreferrer">Timescale instance</a>.)</p><h2 id="lets-start-exploring-time-series-insights">Let's Start Exploring Time-Series Insights!</h2><p><br>We've provided the steps needed to ingest the dataset into TimescaleDB in the <a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">accompanying tutorial</a>, so we won’t go into that here. </p><p>The NFL dataset includes the following data:</p>
<ul>
<li><strong>Games</strong>: all relevant data about each game of the regular season, including date, teams, time, and location</li>
<li><strong>Players</strong>: information on each player, including what team they play for and their originating college</li>
<li><strong>Plays</strong>: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.</li>
<li><strong>Week [1-17]</strong>: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the <code>tracking</code> table and totals almost 20 million rows of time-series data.)</li>
</ul>
<p>In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see <a href="https://timescale.ghost.io/blog/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/">our TimescaleDB vs. InfluxDB comparison</a> for reference). </p><p>Because TimescaleDB is PostgreSQL with time-series superpowers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.</p><p>Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!</p><h2 id="the-power-of-sql">The Power of SQL</h2><p>Year after year, we see SQL listed as one of the most popular languages among developers on the <a href="https://insights.stackoverflow.com/survey/2020#technology-programming-scripting-and-markup-languages-all-respondents">Stack Overflow survey</a>. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.</p><p>SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.</p><p>Not so! SQL can data munge with the best of them! Let's look at a first, quick example.</p><h3 id="average-yards-per-position-per-game">Average yards per position, per game</h3><p>For this first example, we'll query the <code>tracking</code> table (the player movement data from all 17 weeks of games) and join to the <code>game</code> table to determine the number of yards per player position, per game.</p>
<p>The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.</p><pre><code class="language-SQL">WITH total_position_yards AS (
	SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;
</code></pre>
<h3 id="number-of-plays-by-offensive-player"><br>Number of plays by offensive player</h3><p>As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.</p><p>This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.</p><pre><code class="language-SQL">WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events &amp; filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;
</code></pre>
<table>
<thead>
<tr>
<th>player_id</th>
<th>display_name</th>
<th>play_count</th>
<th>team_name</th>
</tr>
</thead>
<tbody>
<tr>
<td>2506109</td>
<td>Ben Roethlisberger</td>
<td>725</td>
<td>PIT</td>
</tr>
<tr>
<td>2558149</td>
<td>JuJu Smith-Schuster</td>
<td>691</td>
<td>PIT</td>
</tr>
<tr>
<td>2533031</td>
<td>Andrew Luck</td>
<td>683</td>
<td>IND</td>
</tr>
<tr>
<td>2508061</td>
<td>Antonio Brown</td>
<td>679</td>
<td>PIT</td>
</tr>
<tr>
<td>310</td>
<td>Matt Ryan</td>
<td>659</td>
<td>ATL</td>
</tr>
<tr>
<td>2506363</td>
<td>Aaron Rodgers</td>
<td>656</td>
<td>GB</td>
</tr>
<tr>
<td>2505996</td>
<td>Eli Manning</td>
<td>639</td>
<td>NYG</td>
</tr>
<tr>
<td>2543495</td>
<td>Davante Adams</td>
<td>630</td>
<td>GB</td>
</tr>
<tr>
<td>2540158</td>
<td>Zach Ertz</td>
<td>629</td>
<td>PHI</td>
</tr>
<tr>
<td>2532820</td>
<td>Kirk Cousins</td>
<td>621</td>
<td>MIN</td>
</tr>
<tr>
<td>79860</td>
<td>Matthew Stafford</td>
<td>619</td>
<td>DET</td>
</tr>
<tr>
<td>2504211</td>
<td>Tom Brady</td>
<td>613</td>
<td>NE</td>
</tr>
</tbody>
</table>
<p>If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others. </p><p>Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.</p><pre><code class="language-SQL">WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events &amp; filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;
</code></pre>
<p>So, now we can see the non-quarterbacks who are on offense the most in a season:</p><table>
<thead>
<tr>
<th>player_id</th>
<th>display_name</th>
<th>play_count</th>
<th>team_name</th>
<th>position</th>
</tr>
</thead>
<tbody>
<tr>
<td>2558149</td>
<td>JuJu Smith-Schuster</td>
<td>691</td>
<td>PIT</td>
<td>WR</td>
</tr>
<tr>
<td>2508061</td>
<td>Antonio Brown</td>
<td>679</td>
<td>PIT</td>
<td>WR</td>
</tr>
<tr>
<td>2543495</td>
<td>Davante Adams</td>
<td>630</td>
<td>GB</td>
<td>WR</td>
</tr>
<tr>
<td>2540158</td>
<td>Zach Ertz</td>
<td>629</td>
<td>PHI</td>
<td>TE</td>
</tr>
<tr>
<td>2541785</td>
<td>Adam Thielen</td>
<td>612</td>
<td>MIN</td>
<td>WR</td>
</tr>
<tr>
<td>2543468</td>
<td>Mike Evans</td>
<td>610</td>
<td>TB</td>
<td>WR</td>
</tr>
<tr>
<td>2555295</td>
<td>Sterling Shepard</td>
<td>610</td>
<td>NYG</td>
<td>WR</td>
</tr>
<tr>
<td>2540169</td>
<td>Robert Woods</td>
<td>604</td>
<td>LA</td>
<td>WR</td>
</tr>
<tr>
<td>2552600</td>
<td>Nelson Agholor</td>
<td>604</td>
<td>PHI</td>
<td>WR</td>
</tr>
<tr>
<td>2543488</td>
<td>Jarvis Landry</td>
<td>592</td>
<td>CLE</td>
<td>WR</td>
</tr>
<tr>
<td>2540165</td>
<td>DeAndre Hopkins</td>
<td>587</td>
<td>HOU</td>
<td>WR</td>
</tr>
<tr>
<td>2543498</td>
<td>Brandin Cooks</td>
<td>581</td>
<td>LA</td>
<td>WR</td>
</tr>
</tbody>
</table>
<h3 id="sack-percentage-by-quarterback-on-passing-plays">Sack percentage by quarterback on passing plays</h3><p>We can start to go a little deeper by extracting specific data from the <code>tracking</code> table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.</p>
<p>Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).</p><p>Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team.<br><br>To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.</p><p>First, we select the distinct list of all plays, for each quarterback (<code>qb_plays</code>). The reason we do a <code>SELECT DISTINCT…</code> is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.</p>
<p>With this result, we can then count the number of total plays per quarterback (<code>total_qb_plays</code>), the total number of games each quarterback played (<code>qb_games</code>) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (<code>sacks</code>).</p>
<p>With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.</p><pre><code class="language-SQL">WITH qb_plays AS (
	SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
	FROM tracking t 
	WHERE POSITION = 'QB'
),
total_qb_plays AS (
	SELECT count(*) play_count, player_id FROM qb_plays
	GROUP BY player_id
),
qb_games AS (
	SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
	GROUP BY player_id
),
sacks AS (
	SELECT count(*) sack_count, player_id 
	FROM play p
	INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
	WHERE p.passresult = 'S'
	GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;
</code></pre>
<p>If you're an ardent football fan, the results from 2018 probably don't surprise you.</p><table>
<thead>
<tr>
<th>play_count</th>
<th>game_count</th>
<th>sack_count</th>
<th>sack_percentage</th>
<th>display_name</th>
</tr>
</thead>
<tbody>
<tr>
<td>579</td>
<td>16</td>
<td>65</td>
<td>11.23</td>
<td>Deshaun Watson</td>
</tr>
<tr>
<td>602</td>
<td>16</td>
<td>55</td>
<td>9.14</td>
<td>Dak Prescott</td>
</tr>
<tr>
<td>611</td>
<td>16</td>
<td>53</td>
<td>8.67</td>
<td>Derek Carr</td>
</tr>
<tr>
<td>656</td>
<td>16</td>
<td>49</td>
<td>7.47</td>
<td>Aaron Rodgers</td>
</tr>
<tr>
<td>462</td>
<td>15</td>
<td>48</td>
<td>10.39</td>
<td>Russell Wilson</td>
</tr>
<tr>
<td>639</td>
<td>16</td>
<td>47</td>
<td>7.36</td>
<td>Eli Manning</td>
</tr>
<tr>
<td>448</td>
<td>14</td>
<td>45</td>
<td>10.04</td>
<td>Josh Rosen</td>
</tr>
<tr>
<td>659</td>
<td>16</td>
<td>43</td>
<td>6.53</td>
<td>Matt Ryan</td>
</tr>
<tr>
<td>386</td>
<td>14</td>
<td>43</td>
<td>11.14</td>
<td>Marcus Mariota</td>
</tr>
<tr>
<td>619</td>
<td>16</td>
<td>41</td>
<td>6.62</td>
<td>Matthew Stafford</td>
</tr>
<tr>
<td>621</td>
<td>15</td>
<td>38</td>
<td>6.12</td>
<td>Kirk Cousins</td>
</tr>
<tr>
<td>324</td>
<td>11</td>
<td>37</td>
<td>11.42</td>
<td>Ryan Tannehill</td>
</tr>
<tr>
<td>447</td>
<td>11</td>
<td>36</td>
<td>8.05</td>
<td>Carson Wentz</td>
</tr>
</tbody>
</table>
<p>Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.</p><table>
<thead>
<tr>
<th>play_count</th>
<th>game_count</th>
<th>sack_count</th>
<th>sack_percentage</th>
<th>display_name</th>
</tr>
</thead>
<tbody>
<tr>
<td>725</td>
<td>16</td>
<td>25</td>
<td>3.45</td>
<td>Ben Roethlisberger</td>
</tr>
<tr>
<td>682</td>
<td>16</td>
<td>22</td>
<td>3.23</td>
<td>Andrew Luck</td>
</tr>
<tr>
<td>613</td>
<td>16</td>
<td>21</td>
<td>3.43</td>
<td>Tom Brady</td>
</tr>
</tbody>
</table>
<p>Now, let’s try some more “advanced” queries and analyses.</p><h2 id="faster-insights-with-postgresql-and-timescaledb">Faster Insights With PostgreSQL and TimescaleDB</h2><p>So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements. </p><p>Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.</p><p>This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.</p><h2 id="faster-queries-with-timescaledb-continuous-aggregates">Faster Queries With TimescaleDB Continuous Aggregates</h2><p>We noticed that we often needed to build queries that started with the <code>tracking</code> table, filtering data by specific players, positions, and games. Part of the reason is that the <code>play</code> table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the <code>tracking</code> table to identify the players who were involved in any given play.</p>
<p>The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.</p><p>This means that every row in <code>tracking</code> has to be read and aggregated <em>before</em> we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!</p>
<p>On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).</p><p>Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our <a href="https://docs.timescale.com/timescaledb/latest/how-to-guides/continuous-aggregates/">continuous aggregates docs</a>.)</p><pre><code class="language-SQL">CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;
</code></pre>
<p>After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.</p><pre><code class="language-SQL">WITH total_position_yards AS (
	SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
	GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;
</code></pre>
<p>We get the same result, but now the query runs in 100ms - <strong>800x faster</strong>!</p><h2 id="advanced-sql-data-analysis-with-timescaledb-hyperfunctions">Advanced SQL Data Analysis With TimescaleDB Hyperfunctions</h2><p>Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.</p><p>It is for this kind of analysis that we built <a href="https://timescale.ghost.io/blog/blog/introducing-hyperfunctions-new-sql-functions-to-simplify-working-with-time-series-data-in-postgresql/">TimescaleDB hyperfunctions</a>, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.</p><h3 id="grouping-data-into-percentiles">Grouping data into percentiles</h3><p>The NFL dataset is a great use case for <a href="https://docs.timescale.com/api/latest/hyperfunctions/percentile-approximation/">percentiles</a>. Being able to quickly find players that perform better or worse than some cohort is really powerful.</p><p>As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.</p><pre><code class="language-SQL">WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;
</code></pre>
<table>
<thead>
<tr>
<th>position</th>
<th>mean_yards</th>
<th>median_yards</th>
</tr>
</thead>
<tbody>
<tr>
<td>FS</td>
<td>595.583433048431</td>
<td>626.388099960848</td>
</tr>
<tr>
<td>CB</td>
<td>572.3336749867212</td>
<td>592.2175990890378</td>
</tr>
<tr>
<td>WR</td>
<td>552.6508570179277</td>
<td>555.5030569048633</td>
</tr>
<tr>
<td>S</td>
<td>530.6436781609186</td>
<td>550.5961518474892</td>
</tr>
<tr>
<td>SS</td>
<td>522.5604103343453</td>
<td>551.1296628916651</td>
</tr>
<tr>
<td>MLB</td>
<td>462.70229007633407</td>
<td>490.77906906009343</td>
</tr>
<tr>
<td>ILB</td>
<td>402.7882871125599</td>
<td>403.3779668359464</td>
</tr>
<tr>
<td>OLB</td>
<td>393.40014271151847</td>
<td>390.6742117791442</td>
</tr>
<tr>
<td>QB</td>
<td>334.7025466893028</td>
<td>352.1192705472368</td>
</tr>
<tr>
<td>LB</td>
<td>328.9812527472519</td>
<td>257.72003396053884</td>
</tr>
<tr>
<td>TE</td>
<td>327.9515596330271</td>
<td>257.72003396053884</td>
</tr>
</tbody>
</table>
<h3 id="finding-extreme-outliers">Finding extreme outliers</h3><p>Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!</p><pre><code class="language-SQL">WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
	SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
	FROM sum_yards 
	GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
	FROM sum_yards a
	LEFT JOIN position_percentile pp ON a.POSITION = pp.position
	WHERE yards &gt;= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;
</code></pre>
<table>
<thead>
<tr>
<th>position</th>
<th>display_name</th>
<th>yards</th>
<th>p95</th>
</tr>
</thead>
<tbody>
<tr>
<td>FS</td>
<td>Eric Weddle</td>
<td>13869.759999999997</td>
<td>12320.288323166456</td>
</tr>
<tr>
<td>FS</td>
<td>Adrian Amos</td>
<td>12989.439999999966</td>
<td>12320.288323166456</td>
</tr>
<tr>
<td>FS</td>
<td>Tyrann Mathieu</td>
<td>12565.219999999956</td>
<td>12320.288323166456</td>
</tr>
<tr>
<td>QB</td>
<td>Aaron Rodgers</td>
<td>7422.35999999995</td>
<td>6667.51452813257</td>
</tr>
<tr>
<td>QB</td>
<td>Patrick Mahomes</td>
<td>6985.989999999952</td>
<td>6667.51452813257</td>
</tr>
<tr>
<td>QB</td>
<td>Matt Ryan</td>
<td>6759.959999999969</td>
<td>6667.51452813257</td>
</tr>
<tr>
<td>TE</td>
<td>Zach Ertz</td>
<td>13124.58999999995</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Jimmy Graham</td>
<td>12693.679999999982</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Travis Kelce</td>
<td>12218.129999999957</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>David Njoku</td>
<td>11502.159999999965</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>George Kittle</td>
<td>11058.099999999975</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Kyle Rudolph</td>
<td>10761.949999999968</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>TE</td>
<td>Jared Cook</td>
<td>10678.22999999998</td>
<td>10667.986199523099</td>
</tr>
<tr>
<td>WR</td>
<td>Antonio Brown</td>
<td>16877.559999999965</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Brandin Cooks</td>
<td>15510.01999999995</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>JuJu Smith-Schuster</td>
<td>15492.76999999996</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Robert Woods</td>
<td>15253.179999999958</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Nelson Agholor</td>
<td>15180.32999999997</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Tyreek Hill</td>
<td>15106.609999999973</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Zay Jones</td>
<td>14790.589999999967</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Sterling Shepard</td>
<td>14673.79999999996</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Mike Evans</td>
<td>14620.129999999983</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Davante Adams</td>
<td>14574.509999999951</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Kenny Golladay</td>
<td>14354.499999999973</td>
<td>14271.23409723974</td>
</tr>
<tr>
<td>WR</td>
<td>Jarvis Landry</td>
<td>14281.509999999971</td>
<td>14271.23409723974</td>
</tr>
</tbody>
</table>
<h2 id="where-can-the-data-take-you">Where Can the Data Take You?</h2><p>As you’ve seen in this example, <strong>time-series data is everywhere</strong>. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.</p><p>We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.</p><p><strong>To get started with the </strong><a href="https://www.kaggle.com/c/nfl-big-data-bowl-2021/overview"><strong>NFL data</strong></a><strong>:</strong></p><ul><li><strong>Spin up a fully managed TimescaleDB service</strong>: create an account to <a href="https://console.cloud.timescale.com/signup">try it for free</a> for 30 days.</li><li><a href="https://docs.timescale.com/timescaledb/latest/tutorials/nfl-analytics/">Follow our complete tutorial</a> for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.</li></ul><p>If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, <a href="https://slack.timescale.com">join our public Slack community</a>. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.</p><p>🙏 We’d like to thank the NFL for making this data available and the millions of passionate fans around the world who make the NFL such an exciting game to watch.</p><p>And, Geaux Saints 🏈!</p>]]></content:encoded>
        </item>
    </channel>
</rss>