#hash Farm: 2016

Saturday, August 13, 2016

Binding with Infamous 0000-00-00 in MySQL via JDBC

  Sometimes you are dealing with a poor maintained MySQL server. It may contain some dirty data, including the infamous datetime 0000-00-00 00:00:00 or date 0000-00-00. First of all, if you are developing a new service, just don't use the value 0000-00-00 as a date. Or you are running an existed service and have the privileges, write a script map them into NULL or 1970-01-01. Zero dates cause a lot of problems, some of them may not solvable with this post.

  Back to the topic, when you read a 0000-00-00 from MySQL as a java.sql.Date, it raises an error '0000-00-00' can not be represented as java.sql.Date.
  JDBC has provided a configuration to solve this problem, which is

zeroDateTimeBehavior=convertToNull

  When you are connecting MySQL server, you may specify this configuration in the URL, like mysql://user:pass@host:port/db?zeroDateTimeBehavior=convertToNull. With this configuration, JDBC automatically transforms your zero dates/datetimes to 1970-01-01 (00:00:00).

  There are some space left, so let me talk about the problems you are going to face if you continue to use zero dates.
  The first is the cause of this post, reading with program. Zero dates are obviously invalid date to any formats. When you use most dataframe framework, it would cause an error the above.
  Second, transferring database. If you want to migrate to PostgreSQL(or any other popular database), it would be an inevitable problem. Since PostgreSQL do not allow zero dates in it (and most database do not). My solution is creating a view foreach table which map the zero dates to NULL or 1970-01-01, then export them to the new database.
  Third, binding data visualize software or analysis software. Of course most commercial software would deal with the zero dates problem. The real problem is, reading a date 0000-00-00 doesn't make any sense to marketing or decision team. You would have to filter them or transform them to an explainable report, so why not do it when designing the table schema?

Reference:
  zeroDateTimeBehavior

Saturday, August 6, 2016

MSQLdb/SQLAlchemy Python Streaming

When you are dealing with huge query result from sql server(let's say, about 30k rows), your program may consume a lot of memory to collect the result. In such a case, you probably want to retrieve the result by streaming. This article shows how to enable the streaming feature with python.

If you are using MySQLdb, the drill is simple. You just instantiate your cursor with SSCursor, and boom, this cursor serve you as a server-side streaming cursor. The following snippet is a demonstration.

import MySQLdb.cursors
import MySQLdb

conn = MySQLdb.connect(host=host, user=user, passwd=password, db=db)
cursor = SSCursor(conn)
query = "SELECT * FROM big_table;"
cursor.execute(query);
# rowcount wouldn't work here

For those projects using SQLAlchemy, you may have tried conn.execution_options(stream_results=True) and fruitless, it still consumes a lot of memory. Since the flag doesn't work with MySQL.
Cheers, Love! The Cavalry's Here! There is another solution there, since SQLAlchemy is based on MySQLdb, they have actually provided a way to inject the SSCrusor. Here is an example.

import sqlalchemy
import MySQLdb.cursors
CHUNK_SIZE = 10000

url = "mysql://%s:%s@%s:%d/%s" % (user, password, host, port, db)
query = "SELECT * FROM big_table;"
conn = sqla.create_engine(url, encoding="utf-8",
  connect_args={"cursorclass": MySQLdb.cursors.SSCursor})
cursor = conn.execute(query)

rows = cursor.fetchmany(CHUNK_SIZE)
while(len(rows) > 0):
  # do whatever it is you do to the data
  rows = cursor.fetchmany(CHUNK_SIZE)

Of course, you would like to read the streaming result as some chunks, so the overwhelming rows number wouldn't cause a network transferring bottleneck.

Another thing to mention. While enabling the streaming feature, rowcount wouldn't work. Probably because the results are stored in server-side, it's impossible to read how many rows it actually be.

Thursday, July 7, 2016

Bind Same Event with Different Identifier in JavaScript

Sometimes, you want to bind multiple function to an event in js. Accomplish that is simple, just repeatedly bind your functions to the event. Like the following,

$("#hatch-chick").bind("click", function(){ "remark the hatch date on the sheet.";});
$("#hatch-chick").bind("click", function(){ "turn on the heat lamp.";});

And you are able to remove the event functions by calling .unbind(eventName), like the following,

$("#hatch-chick").unbind("click");

However, sometimes you do not want to remove all of them at once. You may want to remove the "remark" one specifically. To achieve that, you need to name your event by adding postfix to the origin event name. For example, click to click.remark. Adding the postfix would not affect the functionality, the function is still triggered by click event. But now, you are able to identify the function by naming click.remark, so you can unbind the "remark" function only.
The following is a simple example.

$("#hatch-chick").bind("click.remark", function(){ "remark the hatch date";});
$("#hatch-chick").bind("click.heat", function(){ "turn on the heat lamp";});

$("#hatch-chick").unbind("click.remark");

Sunday, July 3, 2016

Add a Local Dependency with Maven

  Sometimes, you are building your private project, and you developed some personal library for your very own usage. you do not want to publish your library to public Maven repository, and you need not. This article shows you how to import a local Maven dependency.

  Before we start, assume that you have already have a jar file, coop-0.0.1.jar. Its group id, artifact id and version are net.sunshire, coop and 0.0.1, respectively. And your current project is net.sunshire:farm:0.0.1, which is at FARM_HOME=~/WorkSpace/farm.

  First I recommend you to make a directory for your local library, named LIB_SOURCE=$FARM_HOME/mvn-lib, and copy your dependency jar into there. Then you make another directory for local Maven repository, named LOCAL_MVN=$FARM_HOME/local-maven-repo. So you are prepared, by executing the following, you are able to deploy the jar to your local Maven repository.

mvn deploy:deploy-file \
  -Durl=file:///$LOCAL_MVN \
  -Dfile=$LIB_SOURCE/coop-0.0.1.jar \
  -Dpackaging=jar \
  -DgroupId=net.sunshire \
  -DartifactId=coop \
  -Dversion=0.0.1

To verify if your deploy is successful, you may check the files under $LOCAL_MVN/. There should be directories and jar named after your group id and artifact id.(e.g. )

After you have successfully deployed your local dependency. You may add the following into your current project pom.xml.

<project ...>
  ...
  <dependencies>
    <dependency>
      <groupId>net.sunshire</groupId>
      <artifactId>coop</artifactId>
      <version>0.0.1</version>
    </dependency>
  </dependencies>
  <repositories>
    <repository>
      <id>local-maven-repo</id>
      <url>file:${project.basedir}/local-maven-repo</url>
    </repository>
  </repositories>
</project>

When you are finished, you are able to compile your project by executing mvn scala:compile test.

Another worth mentioning thing, when you deploy the local repository with mvn deploy:deploy-file ..., it also deploy a duplication to ~/.m2/. So I would recommend you change the version number of the dependency every time you made a change of it, otherwise Maven uses the cache in ~/.m2/ prior.

Saturday, June 25, 2016

Including and Reading Resources with Maven/Java/Scala

  In last few days, I tended to make a Maven dependency to read resources from its own jar. And I suddenly found that I have no experience of that before. After some googling and experiments, I managed to read the resources, and here is the notes.

  Let's say, you want to publish a jar library which associated a few default configuration files, of course, you never want to write the configuration into your code.(don't do that if you had ever thought about it before.) Then how do you keep the file in the jar? Experienced Java developer should know that jar is a compress file, or more literally, a zip file, so being able to keep a file inside it is reasonable, isn't it? If you are using Maven as your publishing tool, thing should be easy.

  Setting up your pom.xml, assign the resource setup like a boss, uh, I mean, like the following.

<project>
  ...
  <name>My Resources Plugin Practice Project</name>
  ...
  <build>
    ...
    <resources>
      <resource>
        <directory>src/resources</directory>
        <includes>
          <include>**/*</include>
        </includes>
      </resource>
      ...
    </resources>
    ...
  </build>
  ...
</project>

Of course you can assign wherever you want, but in practical, I would recommend to put the resources directory under src/, or src/main if you want to separate resources between the main and the test program.

  Another tip, for some files that are included for a specific package or class, you may want to put the files into the directory same as the package name. For example, I got a picture chick.jpg for net.sunshire.farm.Chick, I would put the picture at the path src/main/resources/net/sunshire/farm/chick.jpg. By doing so, Maven packages the chick.jpg in the package net/sunshire/farm, which allows you keep the independency of each jar.(You won't mass resources between different jars when you are importing many different dependencies.)

  Now you are done with including resources, if you package the jar with Maven, you can see the resource that you just included inside your jar file.(To inspect a jar file, simply change the extension to .zip and decompress it.)

  And how do we read the resource that we included with Scala? It is actually pretty simple, there is a method getResourceAsStream of Class. Here is the usage in Scala:

package net.sunshire.farm;

class Chick {
  val picPath: String = "/net/sunshire/farm/chick.jpg";
  // relative one: "chick.jpg", do NOT recommend
  val chickPic: InputStream = getClass.getResourceAsStream(picPath);
  // do whatever you want with the stream.
}

The above demonstration is pretty straightforward, I guess most most problem is that we do not know the method. The worth mentioning part is that getResourceAsStream takes either relative path or absolute path, same as UNIX/POSIX system ,/ represents root path. To clarify, the absolute path did not mean from the root of the file system, it is from the root of the jar. And the relative path is start from the package declared on the top(/net/sunshire/farm/ in the example).
Another thing, if you specify the relative path, it is going to read the relative path of the run time package.(net.sunshire.house.Chicken extends Chick, and Chick specified the path with the relative one, then the path is going to be /net/sunshire/house/chick.jpg)

If you do not understand Scala and need a Java example, the pom.xml part is the same, for the code part please refer to the 2nd reference. This is all about how to including and reading resources, hope it is helpful. :D

References:
[1] Resources Packaging - Maven Official
[2] Read a Resource in Java - Stack Overflow

Saturday, March 12, 2016

Interface to Google APIs with Python

I have demonstrated how to obtain a Google App credential of a user in the last post. And of course, you read that post because you need to create an application. This post is talking about how to use Google-API-Python-Client and its interface concept idea(of course, under my own interpretation). I will mainly use Drive API in this demonstration.

To call Google services you need to import the following packages:


import apiclient

from oauth2client.file import Storage

import httplib2

Before you calling a service API, you need to retrieve corresponding service with the user credential obtained by previous post. Like the following:


SERVICE_NAME = 'drive'

SERVICE_VERSION = 'v3'

storage = Storage(CREDENTIAL_NAME)

credential = storage.get()

http = credential.authorize(httplib2.Http())

service = apiclient.discovery.build(SERVICE_NAME, SERVICE_VERSION, http=http)

The SERVICE_NAME is which service you are going to retrieve, and SERVICE_VERSION is the service, obviously. The services names and versions can be seen on APIs-Explorer. You can easily click the service you want to interface, and check the prefix before first .(dot) of each API. The same service suppose to have the same prefix, that is your service name. And the version just depends on your need, it shall be in the form v[1-9].

Now you have retrieved the service instance, let's see which API you are going to call. At the previous section, you have decide the service name with first term of API name. You are going to check what the remaining terms are, that is the path to make the API call. For example, when you are calling drive.files.list it maps to Python code service.files().list(...); or analytics.data.ga.get corresponds to service.data().ga().get(...)(Of course, the service are retrieve by different build() call.)

And you are giving the parameters of an API with the Pythonic named arguments. The following is a call of drive.files.list:


PAGE_SIZE = 10

QUERY = "name contains 'Hello'"

FIELDS = 'nextPageToken,files(id, name)'

service.files().list(




pageSize=PAGE_SIZE,

q=QUERY,

fields=FIELDS

)

According the document of drive.files.list, we can simply assign the parameters with named arguments.

However, it is not always as simple as the above. Like the API drive.files.create, you cannot specify the upload content in parameters shown in reference. In this case, I recommend you to check your API with Interactive Help in IPython(or pure Python interpreter). Like the following:


help(service.files().create)

With help(...), you are able to check all information about this function, including arguments and returned value in a very detailed way.

After checked the drive.files.create with help(...), now we are able to make file upload(creation). Like the following:


FILE_NAME = "test_text"

mediaBody = apiclient.http.MediaFileUpload(FILE_PATH, mimetype='text/plain')

body = {


"name": FILE_NAME

}

service.files().create(media_body=mediaBody, body=body)

# or the following

# service.files().create(media_body=mediaBody, name=FILE_NAME)

This is it. When I was interfacing Google APIs, I have encountered a lot problems and did a lot of googling. The above all was my notes about the package. Help it is helpful for you.

Monday, February 29, 2016

Generate Google Application User Credentials with Python/Flask

Recently, I got a demand of building Google Application for our company. After some trials, I understood how things work. And I want to document it, hope it is going to help people who met a similar need.

First of all, you need to create a project for your application on Google Develops Console. And then you need to enable the service you are going to interface, by click the link 'Enable and manage APIs' or enter the 'API Manager.' Now you are suppose to see a few lists of Google APIs, choose the one that you need, and click the enable button.

The following step is to generate application credentials. Go to the 'API Manager > Credentials,' and click Create Credentials > API key. You are going to select between types 'Server key', 'Browser key', 'iOS key' or 'Android key', this depends on which platform your application is(Here I choose the server key, because my application is going to call APIs on a server instead of browsers or mobiles.) All information you need to provide for this key is server name(If you can provide your server IP address is better, of course.) Now you can see you API key in the credentials page.

After you generated an API key, there are another credential to generate. Which is OAuth 2.0 Client ID. Before you click the 'create credentials' button, you need to switch to the tab 'OAuth consent screen' which is just above the 'create credentials' button and fill the field 'Product name shown to users'. Now you are fine to create the 'OAuth client ID' by clicking 'create credentials'. You need to choose which type of your application type(I guess this doesn't really matter, I choose the 'Other' type), and fill your application name. After you created the client ID, you are able to download your client key in the credentials page, the right side of your client ID. And your client secret are able to be seen by click your client ID name.

** Remember, anyone of API keys, client IDs and client secret are suppose to be kept secret. DO NOT share with anyone which is not in your project. **

Now you are done with all the paperworks, let's get hands dirty. You need to get user credentials to perform actions for this user. I deploy an web application to acquire user credentials with Python/Flask. You need to build two pages, one is for requesting authentications, the other one is for retrieve authentications. The application may like this


from oauth2client.client import OAuth2WebServerFlow

from oauth2client.file import Storage



from flask import Flask, url_for, redirect, request

app = Flask(__name__)



CLIENT_ID_FILE = "your_client_id_file_name"

CLIENT_SECRET = "your_client_secret"

SCHEME = "http://"

DOMAIN = "localhost:5000"

AUTH_RETURN_PATH = "/auth_return"

REDIRECT_URI = SCHEME + DOMAIN + AUTH_RETURN_PATH

CREDENTIAL_NAME = "credential_name"

SCOPES = [

  "https://www.googleapis.com/auth/drive.file"

]



@app.route("/auth")

def auth():

  storage = Storage(CREDENTIAL_NAME)

  credentials = storage.get()

  if not credentials or credentials.invalid:

    print("!!! Cannot find this credential !!!")

    return request_credential(storage)

  else:

    return "Your have authorized your credential."



def request_credential(storage):

  flow = OAuth2WebServerFlow(client_id=CLIENT_ID_FILE,

    client_secret=CLIENT_SECRET,

    scope=SCOPES,

    redirect_uri=REDIRECT_URI)

  auth_uri = flow.step1_get_authorize_url()

  return redirect(auth_uri)



@app.route("/auth_return")

def auth_return():

  flow = OAuth2WebServerFlow(client_id=CLIENT_ID_FILE,

    client_secret=CLIENT_SECRET,

    scope=SCOPES,

    redirect_uri=REDIRECT_URI)

  credentials = flow.step2_exchange(request.args.get("code", ""))

  storage = Storage(CREDENTIAL_NAME)

  storage.put(credentials)

  credentials = storage.get()

  if not credentials or credentials.invalid:

    return "Authorization failed."

  else:

  return "Authorization succeed."

In the above codes, the entry point is 'http://localhost:5000/auth', which would lead user to auth(). It check if the credential exists, it call the request_credential if not. The authentication is following OAuth 2.0 two-step authentication, so it will lead user to Google to authenticate his/her authentication and then redirect to the return path that you have given in redirect_uri.

After you got the credential and call storage.put(credential), the credential will be stored at the location CREDENTIAL_NAME(Absolute or relative path depends on it.)

There is a constant named CLIENT_ID_FILE, which is the path to your client ID in your file system, however, without the extension(.json). And obviously, CLIENT_SECRET is what have been mentioned in the above article.

And the SCOPES in the code, can be found in the API document. In this example, I requested the https://www.googleapis.com/auth/drive.file scope, which is shown in the Drive document.

After you read this article, you should be able to get a Google service credential from a user. If there is any part is unclear, please ask without hesitation, I am glad to answer it. I may write another post for how to operate some Google services that I have interfaced with.

Saturday, February 20, 2016

Spark: Read/Write Sequence files

Recently, I got some extraordinary demands for Spark RDD. I need an RDD which supports multi-key value pair, followed by IO operations. As usual, I started from python and I found that applying saveAsSequenceFile does not always work. After some searches, I assume it is because of the writing type of sequence file. Which is not demanded when writing, however, is demanded when reading(sequenceFile). After all, if I did not specify the output type, how should I know what type to read?

After that, I think I had to quit my obsession on leveraging Spark with Python. I switched the language to Scala, which is fully supported to Spark. Nonetheless, story did not end here perfectly like the fairy tale of prince charming and snow white. I tried to write the key with classes implemented Writable, which would cause an implicit conversion on RDD to SequenceFileRDDFunctions. Unfortunately, I noticed that both key and value has to implemented Serializable, which is reasonable. If the key and value are not serializable, how do we pass them between workers?

At last, to simplify all problems, I decided to transform my keys to a String with concatenation. Since the beginning, all of my keys are String for sure.

TL;DR - The key and value of an RDD have to implement Serializable. If you want to save the RDD, the kay and value have to implement Writable.

Well I guess you are here for how to read/write a sequence file. If you meet the above requirement. You can just do


val path: String = "your/path/for/rdd/here";

val rdd = sc.parallelize(

List(("Raccoon", 1), ("Squirrel", 2), ("Ferret", 3))

);


rdd.saveAsSequenceFile(path);



val readRdd = sc.sequenceFile(path, classOf[Text], classOf[IntWritable]);

// here, the type you have read is Tuple2[Text, IntWritable].

// Do not forget to transform them into Tuple[String, Int].

val usableRdd = readRdd.map{ case (key, value) =>



(key.toString, value.get)
};

Monday, January 18, 2016

Setup WebHDFS on Existed Hadoop and Operate with Python

This article is based on how to enable the WebHDFS on Hadoop and then read/write with Python. If you haven't setup a Hadoop environment, I recommend you follow this tutorial.

Recently, I need to setup and test Hadoop on work. And I find the instruction on Hadoop is not detailed enough. However, after some googling, I finally done the setup and test. Here are some experiences, hope it would be helpful. :)

1. Enable the WebHDFS in the configuration file,

"HADOOP_HOME/etc/hadoop/hdfs-site.xml"

By inserting:


<property>

   <name>dfs.webhdfs.enabled</name>

   <value>True</value>

</property>

And restart the Hadoop server. After that, the WebHDFS shall be ready for Hadoop web API. In this article, I will not cover the authenticated WebHDFS operation.

2. Test the service with curl.

i. Create

For create a file via WebHDFS, you can execute the command

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE"

And you will receive a HTTP 307 which bring you a location to where to create file.
Then execute the following

curl -i -X PUT -T <LOCAL_FILE> <LOCATION_FROM_PREVIOUS>

The above one only allow you to upload a file. If you want to send string to the file, just replace the "-T <LOCAL_FILE>" to "-d <DATA>"

ii. Read

then you can read file with

curl -i -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN"

Reading API also apply the two-step operation strategy. In this part, you just follow the returned location then you can get the data/file you are indexing.

Now, if you have successfully setup the WebHDFS, you will see that it create and return your file correctly. If you want to learn more about operations to WebHDFS with curl. Please refer the official instruction.

3. Operations with Python/hdfs [doc]
i. Install this package

pip install hdfs

ii. Write a configuration file


[global]

default.alias = dev



[dev.alias]

url = http://hdfs.hashfarm.cc:50070

user = sunshire

This config file is telling mtth/hdfs to connect to http://hdfs.hashfarm.cc:50070 as sunshire when the alias is dev, also set 'dev' as the default alias.
And the file shall be saved at ~/.hdfscli.cfg by default. Or you can set the environment variable HDFSCLI_CONFIG to specify your very own config file location.

iii. Write yourself a script


from hdfs import Config

import sys

fileName = "/helloWord"

message = "Hello :)"



client = Config().get_client()



with client.write(fileName, overwrite=True) as writer:

  writer.write(message)



ls = client.list("/")

if fileName not in ls:

  print("file not found")

  sys.exit()



readMessage = ""

with client.read(fileName) as reader:

  readMessage = reader.read()



print("wrote: " + message + ", read: " + readMessage)

In the above code, we generate a client via config. Here it apply the default alias, which is 'dev'. To claim a explicit alias just pass the alias to the first parameter to get_client, like the following:

client = Config().get_client("dev")

With mtth/hdfs, you can read/write hdfs like ordinary python file operations, just using the with syntax.
By using client.list, you can get the sub-directory list of a directory.
Notice, while indexing a file or a directory. Remember to add "/" at the beginning. Otherwise the package will not find a corresponded path, and do not throw an error, either.