Saturday, August 13, 2016

Binding with Infamous 0000-00-00 in MySQL via JDBC

  Sometimes you are dealing with a poor maintained MySQL server. It may contain some dirty data, including the infamous datetime 0000-00-00 00:00:00 or date 0000-00-00. First of all, if you are developing a new service, just don't use the value 0000-00-00 as a date. Or you are running an existed service and have the privileges, write a script map them into NULL or 1970-01-01. Zero dates cause a lot of problems, some of them may not solvable with this post.

  Back to the topic, when you read a 0000-00-00 from MySQL as a java.sql.Date, it raises an error '0000-00-00' can not be represented as java.sql.Date.
  JDBC has provided a configuration to solve this problem, which is
zeroDateTimeBehavior=convertToNull

  When you are connecting MySQL server, you may specify this configuration in the URL, like mysql://user:pass@host:port/db?zeroDateTimeBehavior=convertToNull. With this configuration, JDBC automatically transforms your zero dates/datetimes to 1970-01-01 (00:00:00).

  There are some space left, so let me talk about the problems you are going to face if you continue to use zero dates.
  The first is the cause of this post, reading with program. Zero dates are obviously invalid date to any formats. When you use most dataframe framework, it would cause an error the above.
  Second, transferring database. If you want to migrate to PostgreSQL(or any other popular database), it would be an inevitable problem. Since PostgreSQL do not allow zero dates in it (and most database do not). My solution is creating a view foreach table which map the zero dates to NULL or 1970-01-01, then export them to the new database.
  Third, binding data visualize software or analysis software. Of course most commercial software would deal with the zero dates problem. The real problem is, reading a date 0000-00-00 doesn't make any sense to marketing or decision team. You would have to filter them or transform them to an explainable report, so why not do it when designing the table schema?

Reference:
  zeroDateTimeBehavior

Saturday, August 6, 2016

MSQLdb/SQLAlchemy Python Streaming

  When you are dealing with huge query result from sql server(let's say, about 30k rows), your program may consume a lot of memory to collect the result. In such a case, you probably want to retrieve the result by streaming. This article shows how to enable the streaming feature with python.

  If you are using MySQLdb, the drill is simple. You just instantiate your cursor with SSCursor, and boom, this cursor serve you as a server-side streaming cursor. The following snippet is a demonstration.
import MySQLdb.cursors
import MySQLdb

conn = MySQLdb.connect(host=host, user=user, passwd=password, db=db)
cursor = SSCursor(conn)
query = "SELECT * FROM big_table;"
cursor.execute(query);
# rowcount wouldn't work here


  For those projects using SQLAlchemy, you may have tried conn.execution_options(stream_results=True) and fruitless, it still consumes a lot of memory. Since the flag doesn't work with MySQL.
  Cheers, Love! The Cavalry's Here! There is another solution there, since SQLAlchemy is based on MySQLdb, they have actually provided a way to inject the SSCrusor. Here is an example.
import sqlalchemy
import MySQLdb.cursors
CHUNK_SIZE = 10000

url = "mysql://%s:%s@%s:%d/%s" % (user, password, host, port, db)
query = "SELECT * FROM big_table;"
conn = sqla.create_engine(url, encoding="utf-8",
  connect_args={"cursorclass": MySQLdb.cursors.SSCursor})
cursor = conn.execute(query)

rows = cursor.fetchmany(CHUNK_SIZE)
while(len(rows) > 0):
  # do whatever it is you do to the data
  rows = cursor.fetchmany(CHUNK_SIZE)
  Of course, you would like to read the streaming result as some chunks, so the overwhelming rows number wouldn't cause a network transferring bottleneck.

  Another thing to mention. While enabling the streaming feature, rowcount wouldn't work. Probably because the results are stored in server-side, it's impossible to read how many rows it actually be.