× {{alert.msg}} Never ask again
Get notified about new tutorials RECEIVE NEW TUTORIALS

Bigquery streaming inserts taking time

Nov 18, 2015
<p>Since streaming has a limited payload size, see <a href="https://cloud.google.com/bigquery/streaming-data-into-bigquery" rel="nofollow">Quota policy</a> it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.</p> <p>We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.</p> <p><img src="http://i.stack.imgur.com/zqNcC.png" alt="enter image description here"></p> <p>We seen several side effects although:</p> <ul> <li>the request randomly fails with type 'Backend error'</li> <li>the request randomly fails with type 'Connection error'</li> <li>the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)</li> <li>some other error messages are non descriptive, and they are so vague that they don't help you, just retry.</li> <li>we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.</li> </ul> <p>For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.</p> <hr> <p>The approach you've chosen if takes hours that means <code>it does not scale</code>, and won't scale. You need to rethink the approach with <code>async processes</code>. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.</p> <p>Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like <a href="http://kr.github.io/beanstalkd/" rel="nofollow">Beanstalkd</a>.</p> <p>Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.</p> <p>Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type. </p> <p>You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.</p> <p>On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.</p> <p>It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.</p> <p>A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.</p> <p>Beanstalkd clients can be found in most common languages, a <a href="https://github.com/ptrofimov/beanstalk_console" rel="nofollow">web interface</a> can be useful for debugging.</p> <p>This tip was originally posted on <a href="http://stackoverflow.com/questions/27009118/Bigquery%20streaming%20inserts%20taking%20time/27011553">Stack Overflow</a>.</p>
comments powered by Disqus