#hash Farm: MSQLdb/SQLAlchemy Python Streaming

When you are dealing with huge query result from sql server(let's say, about 30k rows), your program may consume a lot of memory to collect the result. In such a case, you probably want to retrieve the result by streaming. This article shows how to enable the streaming feature with python.

If you are using MySQLdb, the drill is simple. You just instantiate your cursor with SSCursor, and boom, this cursor serve you as a server-side streaming cursor. The following snippet is a demonstration.

import MySQLdb.cursors
import MySQLdb

conn = MySQLdb.connect(host=host, user=user, passwd=password, db=db)
cursor = SSCursor(conn)
query = "SELECT * FROM big_table;"
cursor.execute(query);
# rowcount wouldn't work here

For those projects using SQLAlchemy, you may have tried conn.execution_options(stream_results=True) and fruitless, it still consumes a lot of memory. Since the flag doesn't work with MySQL.
Cheers, Love! The Cavalry's Here! There is another solution there, since SQLAlchemy is based on MySQLdb, they have actually provided a way to inject the SSCrusor. Here is an example.

import sqlalchemy
import MySQLdb.cursors
CHUNK_SIZE = 10000

url = "mysql://%s:%s@%s:%d/%s" % (user, password, host, port, db)
query = "SELECT * FROM big_table;"
conn = sqla.create_engine(url, encoding="utf-8",
  connect_args={"cursorclass": MySQLdb.cursors.SSCursor})
cursor = conn.execute(query)

rows = cursor.fetchmany(CHUNK_SIZE)
while(len(rows) > 0):
  # do whatever it is you do to the data
  rows = cursor.fetchmany(CHUNK_SIZE)

Of course, you would like to read the streaming result as some chunks, so the overwhelming rows number wouldn't cause a network transferring bottleneck.

Another thing to mention. While enabling the streaming feature, rowcount wouldn't work. Probably because the results are stored in server-side, it's impossible to read how many rows it actually be.

#hash Farm

Saturday, August 6, 2016

MSQLdb/SQLAlchemy Python Streaming

No comments:

Post a Comment