Advanced tools for improving performance with Redshift et al

Bogdan Kiselitsa shared this idea 1 month ago
Idea Logged

Hi,


Lately we've been doing some performance testing with a migration to Amazon Redshift, and what we've found is that Yellowfin doesn't allow us to get the most out of it performance wise, due to limitations in functionality. I'll demonstrate with some examples.


So the naive way to model and report on a data model such as the following, will produce SQL as follows:

select
	data_table.data_col
from 
	data_table
	join date_to_data on date_to_data.data_id = data_table.data_id
	join date_time on date_time.date_time_key = date_to_data.date_time_key
where
	date_time.start_date between @start_time and @end_time


You'll notice we really want to filter by date here.


However, the problem here is that to maximise the performance with Redshift, we want to first explicitly restrict the range of the lookup on data_table (which is presumed to be big) to the minimum possible. Ideally, the table date_to_data should be pre-filtered as well.This is because unlike with a traditional DBMS, Redshift will avoid nested loops at all costs because it doesn't scale across nodes.


So the optimal query here needs to become something more like the following, which we can achieve with a more sophisticated approach:


with dates as (
	select 
		min(date_time_key) as min_date_time_key, 
		max(date_time_key) as max_date_time_key 
	from date_time 
	where 
		start_date between @start_time and @end_time
)
,ranges as (
	select 
		min(data_id) as min_id,
		max(data_id) as max_id,
	from date_to_data
	where 
		date_time_key between (select min_date_time_key from dates) and (select max_date_time_key from dates)  
)
select 
	data_table.data_col
from data_table
where
        data_id between (select min_id from ranges) and (select max_id from ranges)


However, it seems that with Yellowfin's modelling tools, there's currently no way to achieve this approach - even using custom SQL tables.You could maybe do it with stored procedures, but that leaves you with a host of other limitations compared to plain modelling.


Can the Yellowfin team please advice is there is any plans to add this kind of more sophisticated capability into the platform? If not, I'd like to suggest it.


Regards,

Bogdan.

Comments (3)

photo
1

Hi Bogdan,

Thanks for submitting your Idea. I'm not sure this is something that's in the immediate road map. That being said, I've submitted this request for review by our Development team to determine whether this is something we will consider implementing in future versions. I'll keep this post updated with further information regarding this.

Another option would be to cache the View data into another writable data source in the form of a traditional RDBMS to mitigate the performance implications.

Thanks,

Ryan

photo
1

Thanks for that, Ryan.


Regarding your idea, I guess it could be done in a pinch, but would not be my first choice as it obviates many of the benefits of using something like Redshift in the first place.


I think that more tools in Yellowfin for getting the most out the various RDBMS can only be a good thing, as there have always been limitations in this regard when the rubber meets the road.


Regards,

Bogdan.

photo
1

Hi Bogdan,

Thanks for the response. I've noted as much in our logged task and will keep this post updated with further information as it is available to me.

Regards,

Ryan