![]() ![]() It starts a dockerized Postgres instance, mounting the data folder on a local volume, then with pg_isready waits for it to be operational and finally creates a schema. "echo 'CREATE DATABASE reddit '|psql -U postgres" tabula-rasa: 'until pg_isready do echo "Waiting for the DB to be up." sleep 4 done' docker exec reddit-postgres sh -c \ p 5442:5432 -e POSTGRES_PASSWORD =mysecretpassword -d postgres:13 v $(shell pwd )/reddit_db:/var/lib/postgresql/data \ Since the goal is to try different approaches to load the data, it’s necessary to be able to easily start and reset a database instance for each one, a process that will be repeated multiple times to reduce the noise in the measurement. The logic can be even more complex, for example one can update flashcards ease factors with an upsert query. In this query the excluded table is an alias the database uses to refer to the row that we tried to insert, and together with old (an alias defined at the beginning of the query for the target table) allows to define a sophisticated logic to merge the data, in this case a check on the retrieved_at field to preserve the most recent comment snapshot. Another possibility is to define your own unique index or use a CHECK expression on that column. Notice that this clause requires a constraint on the given set of columns to work in this case id is primary key, so it has a unique index associated. It starts like a normal insert, but the ON CONFLICT clause specifies what do do when a duplicate on the id column is present. WHERE EXCLUDED.retrieved_at >= old.retrieved_at ON CONFLICT(id) DO UPDATE SET author = thor, Postgres calls it ON CONFLICT, and it’s a clause to execute in case a query raises a constraint error. Luckily many databases have a function called upsert (update insert) to perform this operation in one shot. While it’s possible, doing multiple queries to check the data already in the DB and decide what to do for each input entry is slow and verbose. The retrieved_at field is essential: since the script takes many hours to download this data it’s usually running in sessions, and the score changes as people vote the comments, so we are going to see the same comment in different files for different observation times, and we must update the database only when the retrieved_at of an observation is greater than the one already stored. retrieved_at: the timestamp of when the data was retrieved.id: an unique id of the comment (used as parent_id by possible replies).parent_id: the comment or submission this comment is replying to.created_at: a timestamp with a time zone.score: the number of positive or negative votes the comment received.I want to later use this data to perform NLP tasks, but for the scope of this topic let’s just say I have a bunch of JSONL files describing comments (and submission) in this form: They were downloaded using the Reddit API and this subreddit downloader script My use case here is to ingest into Postgres a corpus of 5 million subreddit comments. The case: ingest 5 million comments from Reddit using the copy function programmatically, including the binary mode.performing batch insertions in single SQL queries using execute_values.using prepared statements and executemany. ![]() ![]() using different libraries: asyncpg, psycopg2 and psycopg3.setting up a Postgres instance in an automated and reproducible way to make benchmarking a breeze.ON CONFLICT functionality, (aka UPSERT) to integrate with existing data In this article I’ll show a real case and discuss some aspects like: This however becomes harder to do when the data has a more complex layout, for example it comes from an API or JSON files, and needs to be merged with data that is already loaded into the DB according to some rule. The copy command has some options, but in general it’s used to perform an initial load from a CSV file. The task of ingesting data into Postgres is a common one in my job as data engineer, and also in my side projects.Īs such, I learned a few tricks that here I’m going to discuss, in particular related to ingesting data from Python and merging it with existing rows.īefore starting, I have to say the fastest way to insert data into a Postgres DB is the COPY command, which has a counterpart \copy on the psql CLI tool that is useful to invoke it remotely. ![]()
0 Comments
Leave a Reply. |