Episode Details
Back to Episodes
R or T-SQL? One Button Changes Everything
Published 5 months, 1 week ago
Description
Here’s a story: a team trained a model, everything worked fine—until the dataset doubled. Suddenly, their R pipeline crawled for hours. The root cause wasn’t the algorithm at all. It was compute context. They were running in local compute, dragging every row across the network into memory. One switch to SQL compute context pushed the R script to run directly on the server, kept the data in place, and turned the crawl into a sprint. That’s the rule of thumb: if your dataset is large, prefer SQL compute context to avoid moving rows over the network. Try it yourself—run the same R script locally and then in SQL compute. Compare wall-clock time and watch your network traffic. You’ll see the difference. And once you understand that setting, the next question becomes obvious: where’s the real drag hiding when the data starts to flow?The Invisible BottleneckWhat most people don’t notice at first is a hidden drag inside their workflow: the invisible bottleneck. It isn’t a bug in your model or a quirk in your code—it’s the way your compute context decides where the work happens. When you run in local compute context, R runs on your laptop. Every row from SQL Server has to travel across the network and squeeze through your machine’s memory. That transfer alone can strangle performance. Switch to SQL Server compute context, and the script executes inside the server itself, right next to the data. No shuffling rows across the wire, no bandwidth penalty—processing stays local to the engine built to handle it. A lot of people miss this because small test sets don’t show the pain. Ten thousand rows? Your laptop shrugs. Ten million rows? Now you’re lugging a library home page by page, wondering why the clock melted. The fix isn’t complex tuning or endless loop rewrites. It’s setting the compute context properly so the heavy lifting happens on the server that was designed for it. That doesn’t mean compute context is a magic cure-all. If your data sources live outside SQL Server, you’ll still need to plan ETL to bring them in first. SQL compute context only removes the transfer tax if the data is already inside SQL Server. Think of it this way: the server’s a fortress smithy; if you want the blacksmith to forge your weapon fast, you bring the ore to him rather than hauling each strike back and forth across town. This is why so many hours get wasted on what looks like “optimization.” Teams adjust algorithms, rework pipeline logic, and tweak parameters trying to speed things up. But if the rows themselves are making round trips over the network, no amount of clever code will win. You’re simply locked into bandwidth drag. Change the compute context, and the fight shifts in your favor before you even sharpen the code. Still, it’s worth remembering: not every crawl is caused by compute context. If performance stalls, check three things in order. First, confirm compute context—local versus SQL Server. Second, inspect your query shape—are you pulling the right columns and rows, or everything under the sun? Third, look at batch size, because how many rows you feed into R at a time can make or break throughput. That checklist saves you from wasting cycles on the wrong fix. Notice the theme: network trips are the real tax collector here. With local compute, you pay tolls on every row. With SQL compute, the toll booths vanish. And once you start running analysis where the data actually resides, your pipeline feels like it finally got unstuck from molasses. But even with the right compute context, another dial lurks in the pipeline—how the rows are chunked and handed off. Leave that setting on default, and you can still find yourself feeding a beast one mouse at a time. That’s where the next performance lever comes in.Batch Size: Potion of Speed or SlownessBatch size is the next lever, and it behaves like a potion: dose it right and you gain speed, misjudge it and you stagger. In SQL Server, the batch size is controlled by the `rowsPerRead` parameter. By d