Does Yellowfin reserve memory based on metadata?

Dean Flinter shared this question 6 months ago
Answered

Hi,

I'm creating a new table for some Google Analytics data that I'm extracting from Bigquery.

When testing using transformation flows, I saw that the table generated was using the extract to determine the varchar length for some of the columns

Thing is, I can't really know ahead of time whether certain values will exceed this limit

I was thinking of using the max length supported by my DWH (Snowflake) so we never have truncated values but on researching it, Snowflake have mentioned that some applications may reserve memory based on the metadata.

I don't want Yellowfin reserving 16MB of memory for values that may be a fraction of that as it could quickly consume a lot of the memory available on the server


Thanks

Dean

Comments (8)

photo
1

Hi Dean,

Just looking to get a little clarification on what you're asking here before I start asking developers questions.

Are you referring to memory as in RAM or memory as in, for example, the disk space taken up by a field with the assigned max value for a varchar? I'm fairly certain that Yellowfin will reserve the max size of storage required for a varchar of length x, even if the field isn't fully 'occupied' by characters.

Kind regards,

Chris

photo
1

Hi Chris,

Thanks for getting back to me

I was thinking more about RAM but after looking at some other snowflake posts, they suggest both could be applicable


https://community.snowflake.com/s/article/faq-how-do-i-choose-the-size-of-my-varchar-columns

https://community.snowflake.com/s/question/0D50Z00008C8oNX/why-not-use-varchar-default-length


We don't really save anything to disk so for us RAM allocation is probably more applicable. I was thinking of trying to find the max length of my current data and adding something like 20/30% to it and use that but that still carries the risk of something exceeding that later. Does Yellowfin, when running reports, initially pull it into RAM and then spool the data to disk, which eventually gets purged or is everything done in memory?


Thanks

Dean

photo
1

Hi Dean,

So obviously Yellowfin has various caches for reports, report data, views etc. which do store data at various stages to make it quicker to pull up and display to users. However, in your case, I would think that a transformation flow would write all its data back to the database immediately when run. My reasoning for this is how we recommend big ETL jobs to be scheduled out of hours to minimise strain on the database. However, I can put this to the dev team if you like and see if I can get you a more definitive answer.

Another way around it would be to test the extract using max length from Bigquery with a scheduled ETL during a period of low activity and then monitor Yellowfin's resource use at that time. If RAM use suddenly spikes, you have your answer.

Again, thinking about the problem another way - what if Yellowfin read the data and then dynamically set the varchar field length to an appropriate value automatically... Although this might be an enhancement with a big dollop of wishful thinking!

Kind regards,

Chris

photo
1

Thanks Chris

Yeah the plan is to schedule the flow during the night.

My question though was more in relation to running reports that are based on this extracted data. If this max length varchar field is contained in a report, does Yellowfin assume it will receive the full size and reserve memory or does it increase memory on the fly as the result set is returned?

Sorry for being so pedantic on this! I've been testing this and the report works fine but I'm just worried that as I release it to users, if many are using it simultaneously, we could see RAM usage increase significantly

photo
1

Hi Dean,

That's no worries at all! Better to be clear and have clarification on it than not know! I'll put it to the dev team to get a definitive answer.

Kind regards,

Chris

photo
1

Hi Dean,

I have an answer from our lead dev on this:

The difference between a database Char and VarChar is that VarChars are variable length, so the database will only allocate memory for the length of the data. In some databases VarChar(Max) is mapped to a CLOB type, and a single row can store 2GB of data... However it looks like Snowflakes VarChar(Max) is about 16 MB.


Once the data is brought into Yellowfin (from a Clob or VarChar), the data is saved as a Java String, which will just allocate memory for the actual data. So Yellowfin's code should be optimal with respect to memory usage, but we cannot speak for what happens within the Snowflake JDBC Driver itself, but I assume it would be optimal as well. No pre-allocation of memory happens in Yellowfin.. Memory is just used as each row is received from the database.

I hope that sheds some light on it! So it should be good, provided the Snowflake driver isn't crazy.

Kind regards,

Chris

photo
1

Thanks Chris!

I'm going to presume the driver behaves in a sane manner!

Thanks for your help


Dean

photo
1

Hi Dean,

No worries at all :) I'll mark this as answered. Enjoy the rest of your week!

Kind regards,

Chris