a journal of a researcher: July 2004

Proposal

It is a nightmare to write a research project proposal. Besides the part to talk about the research, you have a lot of overhead to fill the forms and get the signatures. This time, I wait to the last minute.

The Web Services is the topic of our proposal. I found there are a lot to do. From this proposal, I find myself in e-business domain.

Use WSDL for Web Services

You do not need WSDL for web services, if you know how to invoke the services. WSDL is useful in two ways: when you want to invoke a service you do not know, you need information like the methods and the methods’ signatures; secondly, WSDL plays a role exactly like IDL to bridge the different languages, i.e. different languages maps to WSDL and use SOAP for protocol. It is yet another tempt for people to provide a uniform interface over specific languages. This time, it is over HTTP and SOAP. It is easier to code than CORBA. But the complexity of using CORBA and using WSDL are comparable.

Because AXIS uses bean serializer to implement the types in WSDL, the SOAP message uses tages for the attributes and values of the objects (cf. my post on 7/22).

People can image, with additional layers of HTTP and SOAP, the web services are slower than RPC or RMI. Here is a link provided by Michael Sintek on the performance of different SOAP implementations. Hopefully the fastest JibxSoap is just a little bit slower than RMI. Well, it is difficult to say if the slow is significent or not. But it is rather faster than other SOAP implmentations. I cite: “The performance of the JibxSoap implementation is likely due to two factors. First off, the light-weight nature of the framework and efficient internal operation adds very little per-message overhead. This per-message overhead looks to be especially high with Axis (as shown by the poor performance of all the Axis versions at very low response density, where the number of messages exchanged is considerably larger than all the other test cases combined). Secondly, the JiBX data binding basis of JibxSoap allows very fast conversion between XML and Java objects, minimizing the overhead for larger messages.” It is also interesting to know that Sun wants to use binary data encoding to replace XML, in order to improve the web service performance.

Isn’t WSDL another tempt to provide a framework over specific languages? Do we always turn in circles?

Put RDF into SOAP messages

The following problem is discussed with Michael Sintek today. Here is my understanding and some experiements on it.

If you need RDF embedded in SOAP, I have some ideas for AXIS:
1)      Use “message” type services. You can use objects in Element type as the parameters to invoke the services and get Element type back. Then you can create CDATA as children for Element object. In CDATA, you can define any format of xml. (ref. the “message services” in AXIS manual)

2)      Pass as JavaBean. You can define your own beans. The beans are serialized by BeanSerializeFactory. The SOAP message has tags for the attributes and values of the object. E.g. the sample/userguide/sample5 has a SOAP like following to pass Order. (XML has problem to be included in blog)

3)      Define your own serializer. It is quite similar to what Bean serializer can do, but in a more controllable way. See the samples/encoding. You can define your own data type

4)      As attachment. AXIS uses MIME Stream for attachments (e.g. activation.jar and mail.jar). The SOAP message has the stream id. Attachments are not inside SOAP message.

I think only 1) and 4) are the choices.

The log4j system

If you do not set log4j system well, you will get warn message when you run axis applications.

You need a log4j configuration file. The configuration file for logging the SOAP relevant information in file “axis.log” is

# Set root logger level to WARN and its only appender to A1.
log4j.rootLogger=WARN, A1
# show DEBUG for SOAPPart method
log4j.logger.org.apache.axis.SOAPPart=DEBUG

# A1 is set to be a ConsoleAppender.
# log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1=org.apache.log4j.FileAppender
log4j.appender.A1.File=axis.log

# A1 uses PatternLayout.
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

Then you save it in log4j.properties in the directory where the java command runs. Java will find the configuration file and use it automatically. E.g.

Java samples.stock.GetQuote -uuser1 -wpass1 XXX

Or use the option –D to indicate the configuration file:

java -Dlog4j.configuration=file:log4j.properties samples.stock.GetQuote -uuser1 -wpass1 XXX

To open the server side log file for AXIS, put the log4j.properties in

%AXIS_START_DIR%/axis/WEB-INF/classes

You will find the log file at the directory where you start the tomcat server. On my computer is at

%CATALINA_HOME%/bin

(You need option –cp if you did not set the classpath.)

Thanks Michael Sintek to help me understand Log4j and provide the settings for SOAP application.

Blog’s Usage

I got comments from my colleagues on my posts in blog. That is wonderful, because I have so many people to check my work, verify it from various points of view. I found these comments help me deeper my thinking. It also forces me to write down everyday’s progress. That is a great experience. Now I found blog is an interesting tool.

If I do not mind to expose my weak thinking, you can comment as you want. More or less, we each is partially right.

Keogh’s Theorems on Estimating Error

Keogh claims in his recent paper that he gets O(n) algorithm for linear approximation. The reason that the linear approximation algorithms, like Sliding Window and Bottom Up, are O(Ln) is because we need O(L) for calculating the error for each segment, where L= n/k is the average length of k segments. Keogh claims that you need constant time to calculate the error of the merged segment from the two original segments. I tried to calculate the error today. It becomes complex after a while. I did not get the closed form formula. But I think he is right. For calculating the error, there is only SSxx= Sum(Xi^2) – n*Avg(X)^2; SSyy= Sum(Yi^2) – nAvg(Y)^2; SSxy = Sum(Xi*Yi)-n*Avg(X)*Avg(Y), in the formula.

Think about Avg(X)=Sum(Xi)*n. So you have only terms like Sum(Xi), Sum(Xi^2) in the formula. If you have a merged sequence, these terms can be calculated from the terms from the two original segments.

Curvature and the order of polynomial

I was wondering what the relation of curvature and the order of polynomial you select for regression. If the curvature is 0, it is better to use a straight line. If it is high, do we need to use higher order polynomial? I find the later statement does not hold. Very simple, the curvature depends on the point. Not every point on a higher order polynomial has higher curvature. Actually it is hard to tell where the curvature is high, it depends on the shape of polynomial. So there is no relation.

The regression method considers solely the error. Even if you do not have the best fitting curve, you still can get good fitting quality (if you are allowed to partition the domain).

What kind function to choose for approximating indeed depends on the data. If we can visualize the data, we can see that if it is a curve so that we can use polynomial or otherwise straight line. But if we can not, the only way is to try it out. I found a paper to decide the polynomial and the partition.

Why not polynomial regression?

Today I tried polynomial regression on the monotone problem. It seems that the polynomial regression is not better as linear regression for detecting the monotone as following:
1) more complex to compute. I did not check the complexity yet.
2) harder to decide the segment border. The polynomial frequently has a small fluctuation at the endpoints of the segments. It may move if you let the segment overlap though. But maybe it is not worth to investigate the optimal ways.
3) it is so fit to the data that it does not remove the “waves” caused by noise.
4) difficult to choose the order for the polynomial. You may need to choose different order at different segments.

Be a Seller on Amazon

I sold a used book on Amazon. Long time buyer can be once a seller. Amazon is a good platform for buy and sell. For selling a book, my experience is not bad, but there is more to improve. For example, I could not find the webpage to input my checking account information. When I found it, it is out of service for a while. The shipping charge is estimated less than what I had to pay at the counter. Amazon must use its discounted price which is unable to get as normal customer. The buyer sent me three Emails to query the shipment of the book in one week. Hopefully in the last Email, she said the book was arrived.
Amazon earned $6 commission for this transaction.

You can also sell your CDs and software on Amazon. I like Amazon’s way. You post it on a fixed price, and forget it. You get informed when anyone wants it. The payment is mediated by Amazon. You do not go to a third company. I listed the book on Amazon less than two weeks to sell it.

Auction? I created an account on e-bay to sell my cell phone. I am getting rid of my cell phone. But e-bay’s interface is too complex to use. I don’t know if the auction can reach my price and I have to watch it for a while. Finally I did not make a move.

The Complexity of Machine Learning Algorithms

Machine learning algorithms are complicated due to different implementations in different contexts. I just did a quick literature search and consulted my colleagues. Here are some results:
• Decision tree induction: O(mnlogn). n is the number of instances, m is the number of the features.—Witten and Frank, Data Mining.
• K nearest neighbour: O(n) (maybe O(mn) for m features)
• Tune the parameters for Neutral Network: NP hard.

Well, I have no energy to investigate more. People should have a clear answer to it. What I can know is that linear or quadratic algorithms for statistic learning are not rare.

Daniel gives a nice estimation for the complexity: if it takes a minute to process 1 MB, and you have 1 GB of data, it will take more than a year if it is O(n^2). Probably it is too prefect to find linear algorithm, then Daniel claims that O(nlogn) is a target.

Thanks for Daniel Lemire and Peter Turney for the inputs.

A Day

Many emails are about the flat segments. Each one takes me five minutes to think. I think Martin and Daniel are better persons to work on it. Hopefully I convinced them the necessity of flat segments.

I want to move on to the multi dimension case. I will use the partial definition, just one step to go beyond the one dimension. Daniel probably is going to ask me why I want this not elegant method, instead of topology. First, I don’t know topology. Second, I find it is too complex to use in multi dimension case. Well, it is nice to play the theorems. But now I want to reach what Suc can do by his learning algorithm.

I went over Suc’s paper again, and computed the complexity of his algorithm. Surprisingly it is O(n^2logn). I thought it should be higher order. Daniel told me O(n^2) is not good enough. It is if there is O(n). But it is not bad for most of the problems. I am sending several emails to ask people about complexity of learning algorithms. I know there is answer, but I have no time and energy to find it out myself. Who can tell me?

I booked my trip to Spain and I discussed vacation. I even read the rouge guide of Spain during my lunch time. Valencia is the third largest city in Spain and the famous tomato throwing fest is in a near by town called Bunol. And most important is that the tomato throwing only happens once per year on the last Wednesday of August! Exactly the date of ECAI! Am I going to see the tomato throwing? I found I have several old T-shirts in my closet.

Time never is enough. I promised to work on web service this month. I have to do it. But research needs me to focus on a problem until finding the answer.

Playing with the flat segments

I am keen to have a flat segment. Daniel, a pessimist, comes and says, if you want it, you will break the prefect of the framework. Without appreciating the beauty of the theorem, I say, I want it. I invent odd algorithms to break the prefect framework, but I do not have the ability to make up the theorems. I have many inspirations; I have many impulses; but I have no theorems. Whatever, playing in the garden of science already makes me happy.

a journal of a researcher

Monday, July 26, 2004

Proposal

Sunday, July 25, 2004

Use WSDL for Web Services

Thursday, July 22, 2004

Put RDF into SOAP messages

The log4j system

Wednesday, July 14, 2004

Blog’s Usage

Keogh’s Theorems on Estimating Error

Curvature and the order of polynomial

Tuesday, July 13, 2004

Why not polynomial regression?

Wednesday, July 07, 2004

Be a Seller on Amazon

The Complexity of Machine Learning Algorithms

Tuesday, July 06, 2004

A Day

Monday, July 05, 2004

Playing with the flat segments

About

Previous

Archives

NRC Bloggers

My Links

Feeds