Using Trino to query data from Data Warehouse (Part 3)
- 8 minsIn the previous article, I introduced to you a design and installation of Data Warehouse based on Hadoop platform and some Opensource technologies, you can review here. This design uses Spark Thrift Server (STS) which is a JDBC/ODBC server that allows BI applications to connect to DWH to request execution of SQL queries. However, when launched, STS only creates 1 Spark Context object, so at each time only 1 query can be executed. To solve this problem, I use Trino as a query engine, it helps to coordinate resources more reasonably and run multiple queries at the same time.
Contents
- Overview
- Install Hive Metastore (HMS)
- Install and configure Trino on a node
- Add new node to Trino cluster
- Conclusion
I am currently consulting, designing and implementing data analysis infrastructure, Data Warehouse, Lakehouse for individuals and organizations in need. You can see and try a system I have built here. Please contact me via email: lakechain.nguyen@gmail.com. Thank you!
Overview
Trino is a high-performance distributed SQL query engine for big data analysis, which can work with many different datasources through connectors, thereby allowing queries to combine multiple datasources. Trino is neither a Database nor a Datawarehouse, it does not store data, it only provides an SQL interface for users and other applications to query data on different Datasources using SQL language. Trino is built to suit Online Analyst Processing (OLAP) tasks instead of Online Transaction Processing (OLTP).
Some concepts in Trino:
- Datasource: Is a data source that Trino uses, it can be a database, file system (local, hdfs, s3…), google sheet excel, elasticsearch, even kafka… Trino uses connectors corresponding to each datasource to connect to them.
- Connector: Acts as a translator between Trino and the Datasources, you can see the full list of datasources here
- Catalog: Is a declaration of a Datasource in Trino, a catalog declaration includes the connector name and the appropriate configurations for Trino to use the Datasource. In SQL, the catalog is the first level when calling a table: <catalog>.<schema>.<table>
- Schema: is a named group of multiple tables in the catalog, it is equivalent to the concept of a database in relational databases.
- Table: is a table containing data, using SQL users can create, change, query or add data to the table.
Trino is a distributed query engine, meaning it can be installed on multiple nodes and combined together into a cluster. The Trino cluster is also designed according to the master-slaves architecture, with a master node that plays the role of managing, coordinating, and scheduling the entire cluster and slave nodes that play the role of executing tasks assigned by the master. The components in the Trino cluster include:
- Coordinator: Acts as the master in the Trino cluster, is the server to receive SQL queries from the client, parse the syntax, plan the execution and assign tasks to the Workers, synthesize the results from the workers and return them to the client.
- Worker: Acts as the slave node in the Trino cluster, it receives and processes tasks assigned by the Coordinator.
Install Hive Metastore (HMS)
Before we start, I will explain a little bit about why we need HMS. If you remember, in the previous article when I installed Data warehouse (you can review here) I did 2 things:
- One is to create and configure the data storage directory in DWH, a directory on HDFS:
hdfs://node01:9000/user/hive/warehouse
- Two is to create and configure the metadata storage location for DWH, a database in postgresql:
jdbc:postgresql://node01:5432/metastore
These are the 2 components to create a Data warehouse: Storage and Metadata. In which Storage is where data is stored and Metadata is where information such as schemas (databases), tables, table structures, table types, table data storage locations in Storage are stored… With just these two components, a query engine can query data in DWH. In the previous article, I used Spark Thrift Server as a query engine server, but it is also possible to manipulate DWH using Spark SQL (see more here) or in a spark job (see more here)
To query data in DWH, Trino needs a service that provides information about Storage and Metadata, this is what HMS will do.
That’s the theory, now let’s start installing, I will do it on the DWH cluster in the previous article, you can review here. First, we will go to the Hive homepage to download the appropriate hive version here, I use version 2.3.9 because this is the version that my Spark Thrift Server is using.
Note: To know which Hive version STS is using, run the STS job, then go to the job’s SparkUI interface, go to the Environment tab and find the configuration
spark.sql.hive.version
$ wget https://dlcdn.apache.org/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
$ tar -xvzf apache-hive-2.3.9-bin.tar.gz
$ mv apache-hive-2.3.9-bin /lib/hive
$ mkdir /lib/hive/logs
$ chgrp hadoop -R /lib/hive
$ chmod g+w -R /lib/hive
Add environment variables /etc/bash.bashrc
export HIVE_HOME=/lib/hive
export PATH=$HIVE_HOME/bin:$PATH
Update environment variables
$ source /etc/bash.bashrc
Copy configuration and libraries from Spark
$ cp $SPARK_HOME/conf/hive-site.xml $HIVE_HOME/conf/
$ cp $SPARK_HOME/jars/postgresql-42.5.1.jar $HIVE_HOME/lib/
Run HMS on user hive
[hive]$ hive --service metastore &
By default, HMS will run on port 9083
, you can change it by configuring hive.metastore.port
in the file $HIVE_HOME/conf/hive-site.xml
Install and configure Trino on a node
Go to Trino’s homepage to here find the appropriate version for your system, I choose Trino version 389 because this is the latest version that still supports java 11.
Note: you can update java to the latest version to use Trino latest version according to the instructions here
$ wget https://repo1.maven.org/maven2/io/trino/trino-server/389/trino-server-389.tar.gz
$ tar -xvzf trino-server-389.tar.gz
$ mv trino-server-389 /lib/trino
$ mkdir /lib/trino/logs
$ chgrp hadoop -R /lib/trino
$ chmod g+w -R /lib/trino
Add environment variables /etc/bash.bashrc
export TRINO_HOME=/lib/trino
export PATH=$TRINO_HOME/bin:$PATH
Update environment variables
$ source /etc/bash.bashrc
Create the configuration files in turn as follows:
$TRINO_HOME/etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://node01:8080
$TRINO_HOME/etc/jvm.config
-server
-Xmx16G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics
-Dfile.encoding=UTF-8
# Disable Preventive GC for performance reasons (JDK-8293861)
#-XX:-G1UsePreventiveGC
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:GCLockerRetryAllocationCount=32
$TRINO_HOME/etc/log.properties
io.trino=INFO
$TRINO_HOME/etc/node.properties
node.environment=production
node.id=node01
node.data-dir=/home/trino/data
$TRINO_HOME/etc/catalog/hive.properties
connector.name=delta-lake
hive.metastore.uri=thrift://node1:9083
hive.metastore-cache.cache-partitions=false
Note: Here I use the
delta-lake
connector so that Trino can work with Delta table
Install python
$ ln -s /usr/bin/python3 /usr/bin/python
Create user trino
$ useradd -g hadoop -m -s /bin/bash trino
Run Trino as user trino
[trino]$ launcher start
Check Trino status
[trino]$ launcher status
Check TrinoUI: http://node01:8080/ui/
Install Trino CLI
$ wget https://repo1.maven.org/maven2/io/trino/trino-cli/389/trino-cli-389-executable.jar
$ mv trino-cli-389-executable.jar $TRINO_HOME/bin/trino
$ chmod +x $TRINO_HOME/bin/trino
$ trino http://node01:8080/hive
Run some commands on Trino CLI
trino> show catalogs;
trino> show schemas;
trino> select * from ...
Stop Trino
[trino]$ launcher stop
Add new node to Trino cluster
To add a new node to an existing Trino cluster, repeat the steps above and edit the 2 configuration files:
-
$TRINO_HOME/etc/config.properties
`
coordinator=false
http-server.http.port=8080
discovery.uri=http://node01:8080
$TRINO_HOME/etc/node.properties
node.environment=production
node.id=node02
node.data-dir=/home/trino/data
Conclusion
In this article, I have guided you on how to install Trino as a query engine server to replace Spark Thrift Server, I wish you success!