Monday, January 18, 2016

Setup WebHDFS on Existed Hadoop and Operate with Python

This article is based on how to enable the WebHDFS on Hadoop and then read/write with Python. If you haven't setup a Hadoop environment, I recommend you follow this tutorial.

Recently, I need to setup and test Hadoop on work. And I find the instruction on Hadoop is not detailed enough. However, after some googling, I finally done the setup and test. Here are some experiences, hope it would be helpful. :)



1. Enable the WebHDFS in the configuration file,
"HADOOP_HOME/etc/hadoop/hdfs-site.xml"
By inserting:
<property>
   <name>dfs.webhdfs.enabled</name>
   <value>True</value>
</property>
And restart the Hadoop server. After that, the WebHDFS shall be ready for Hadoop web API. In this article, I will not cover the authenticated WebHDFS operation.

2. Test the service with curl.
i. Create
For create a file via WebHDFS, you can execute the command
curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE"
And you will receive a HTTP 307 which bring you a location to where to create file.
Then execute the following
curl -i -X PUT -T <LOCAL_FILE> <LOCATION_FROM_PREVIOUS>
The above one only allow you to upload a file. If you want to send string to the file, just replace the "-T <LOCAL_FILE>" to "-d <DATA>"
ii. Read
then you can read file with
curl -i -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN"
Reading API also apply the two-step operation strategy. In this part, you just follow the returned location then you can get the data/file you are indexing.
Now, if you have successfully setup the WebHDFS, you will see that it create and return your file correctly. If you want to learn more about operations to WebHDFS with curl. Please refer the official instruction.

3. Operations with Python/hdfs [doc]
i. Install this package
pip install hdfs
ii. Write a configuration file
[global]
default.alias = dev

[dev.alias]
url = http://hdfs.hashfarm.cc:50070
user = sunshire
This config file is telling mtth/hdfs to connect to http://hdfs.hashfarm.cc:50070 as sunshire when the alias is dev, also set 'dev' as the default alias.
And the file shall be saved at ~/.hdfscli.cfg by default. Or you can set the environment variable HDFSCLI_CONFIG to specify your very own config file location.
iii. Write yourself a script
from hdfs import Config
import sys
fileName = "/helloWord"
message = "Hello :)"

client = Config().get_client()

with client.write(fileName, overwrite=True) as writer:
  writer.write(message)

ls = client.list("/")
if fileName not in ls:
  print("file not found")
  sys.exit()

readMessage = ""
with client.read(fileName) as reader:
  readMessage = reader.read()

print("wrote: " + message + ", read: " + readMessage)
In the above code, we generate a client via config. Here it apply the default alias, which is 'dev'. To claim a explicit alias just pass the alias to the first parameter to get_client, like the following:
client = Config().get_client("dev")
With mtth/hdfs, you can read/write hdfs like ordinary python file operations, just using the with syntax.
By using client.list, you can get the sub-directory list of a directory.
Notice, while indexing a file or a directory. Remember to add "/" at the beginning. Otherwise the package will not find a corresponded path, and do not throw an error, either.