Using Trino to query data from Data Warehouse (Part 3)

- 8 mins

In the previous article, I introduced to you a design and installation of Data Warehouse based on Hadoop platform and some Opensource technologies, you can review here. This design uses Spark Thrift Server (STS) which is a JDBC/ODBC server that allows BI applications to connect to DWH to request execution of SQL queries. However, when launched, STS only creates 1 Spark Context object, so at each time only 1 query can be executed. To solve this problem, I use Trino as a query engine, it helps to coordinate resources more reasonably and run multiple queries at the same time.

Contents

  1. Overview
  2. Install Hive Metastore (HMS)
  3. Install and configure Trino on a node
  4. Add new node to Trino cluster
  5. Conclusion

:pray: I am currently consulting, designing and implementing data analysis infrastructure, Data Warehouse, Lakehouse for individuals and organizations in need. You can see and try a system I have built here. Please contact me via email: lakechain.nguyen@gmail.com. Thank you!

Overview

Trino is a high-performance distributed SQL query engine for big data analysis, which can work with many different datasources through connectors, thereby allowing queries to combine multiple datasources. Trino is neither a Database nor a Datawarehouse, it does not store data, it only provides an SQL interface for users and other applications to query data on different Datasources using SQL language. Trino is built to suit Online Analyst Processing (OLAP) tasks instead of Online Transaction Processing (OLTP).

Some concepts in Trino:

Trino is a distributed query engine, meaning it can be installed on multiple nodes and combined together into a cluster. The Trino cluster is also designed according to the master-slaves architecture, with a master node that plays the role of managing, coordinating, and scheduling the entire cluster and slave nodes that play the role of executing tasks assigned by the master. The components in the Trino cluster include:

Install Hive Metastore (HMS)

Before we start, I will explain a little bit about why we need HMS. If you remember, in the previous article when I installed Data warehouse (you can review here) I did 2 things:

These are the 2 components to create a Data warehouse: Storage and Metadata. In which Storage is where data is stored and Metadata is where information such as schemas (databases), tables, table structures, table types, table data storage locations in Storage are stored… With just these two components, a query engine can query data in DWH. In the previous article, I used Spark Thrift Server as a query engine server, but it is also possible to manipulate DWH using Spark SQL (see more here) or in a spark job (see more here)

To query data in DWH, Trino needs a service that provides information about Storage and Metadata, this is what HMS will do.

That’s the theory, now let’s start installing, I will do it on the DWH cluster in the previous article, you can review here. First, we will go to the Hive homepage to download the appropriate hive version here, I use version 2.3.9 because this is the version that my Spark Thrift Server is using.

Note: To know which Hive version STS is using, run the STS job, then go to the job’s SparkUI interface, go to the Environment tab and find the configuration spark.sql.hive.version

$ wget https://dlcdn.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
$ tar -xvzf apache-hive-2.3.9-bin.tar.gz
$ mv apache-hive-2.3.9-bin /lib/hive
$ mkdir /lib/hive/logs
$ chgrp hadoop -R /lib/hive
$ chmod g+w -R /lib/hive

Add environment variables /etc/bash.bashrc

export HIVE_HOME=/lib/hive
export PATH=$HIVE_HOME/bin:$PATH

Update environment variables

$ source /etc/bash.bashrc

Copy configuration and libraries from Spark

$ cp $SPARK_HOME/conf/hive-site.xml $HIVE_HOME/conf/
$ cp $SPARK_HOME/jars/postgresql-42.5.1.jar $HIVE_HOME/lib/

Run HMS on user hive

[hive]$ hive --service metastore &

By default, HMS will run on port 9083, you can change it by configuring hive.metastore.port in the file $HIVE_HOME/conf/hive-site.xml

Install and configure Trino on a node

Go to Trino’s homepage to here find the appropriate version for your system, I choose Trino version 389 because this is the latest version that still supports java 11.

Note: you can update java to the latest version to use Trino latest version according to the instructions here

$ wget https://repo1.maven.org/maven2/io/trino/trino-server/389/trino-server-389.tar.gz
$ tar -xvzf trino-server-389.tar.gz
$ mv trino-server-389 /lib/trino
$ mkdir /lib/trino/logs
$ chgrp hadoop -R /lib/trino
$ chmod g+w -R /lib/trino

Add environment variables /etc/bash.bashrc

export TRINO_HOME=/lib/trino
export PATH=$TRINO_HOME/bin:$PATH

Update environment variables

$ source /etc/bash.bashrc

Create the configuration files in turn as follows:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://node01:8080
-server
-Xmx16G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics
-Dfile.encoding=UTF-8
# Disable Preventive GC for performance reasons (JDK-8293861)
#-XX:-G1UsePreventiveGC  
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:GCLockerRetryAllocationCount=32
io.trino=INFO
node.environment=production
node.id=node01
node.data-dir=/home/trino/data
connector.name=delta-lake
hive.metastore.uri=thrift://node1:9083
hive.metastore-cache.cache-partitions=false

Note: Here I use the delta-lake connector so that Trino can work with Delta table

Install python

$ ln -s /usr/bin/python3 /usr/bin/python

Create user trino

$ useradd -g hadoop -m -s /bin/bash trino

Run Trino as user trino

[trino]$ launcher start

Check Trino status

[trino]$ launcher status

Check TrinoUI: http://node01:8080/ui/

Trino UI

Install Trino CLI

$ wget https://repo1.maven.org/maven2/io/trino/trino-cli/389/trino-cli-389-executable.jar
$ mv trino-cli-389-executable.jar $TRINO_HOME/bin/trino
$ chmod +x $TRINO_HOME/bin/trino
$ trino http://node01:8080/hive

Run some commands on Trino CLI

trino> show catalogs;
trino> show schemas;
trino> select * from ...

Stop Trino

[trino]$ launcher stop

Add new node to Trino cluster

To add a new node to an existing Trino cluster, repeat the steps above and edit the 2 configuration files:

coordinator=false
http-server.http.port=8080
discovery.uri=http://node01:8080
node.environment=production
node.id=node02
node.data-dir=/home/trino/data

Conclusion

In this article, I have guided you on how to install Trino as a query engine server to replace Spark Thrift Server, I wish you success!

comments powered by Disqus