OK, another night, another trial. I must say, tonight was a lot more fun than the last couple of nights, because I really felt that I learned something, which is really the whole point of this exercise. So the example I was trying to code tonight is a simple EOF of a 3D data series. This is something that I just had to code up at work today, so it was a perfect chance for me to try out Sage. For work I ended up altering an existing m-file and running the EOFs in Matlab, but that’s OK, because now I know what I expect to see after running this in Sage.
The data names have been changed to protect the innocent.
# Load in required modules
sage: from scipy.io.netcdf import *
sage: from pylab import *
sage: from scipy.stats.stats import nanmean
sage: import datetime
#Load data from NetCDF file
sage: ncfile = netcdf_file('file.nc','r')
sage: varnames = ncfile.variables.keys()
sage: varnames
['LONGITUDE', 'TIME', 'LATITUDE', 'DATA']
#Now that I have the order I can load into arrays
sage: lon = ncfile.variables[varnames[0]][:]
sage: lat = ncfile.variables[varnames[2]][:]
sage: dates = ncfile.variables[varnames[1]][:]
sage: raw = ncfile.variables[varnames[3]][:,0:50,:] #I only want 50 records in Y
sage: data = raw.copy() #make a copy
sage: data.shape
(124, 50, 151)
sage: (ncycles, ny, nx) = data.shape
#deal with dates
sage: ncfile.variables[varnames[1]].attributes
{'axis': 'TIME',
'time_origin': '15-JAN-1901 00:00:00',
'units': 'HOURS since 1901-01-15 00:00:00'}
sage: off = datetime.datetime(1901,1,15,0,0,0)
sage: months = ones(ncycles)
sage: for i in range(0,ncycles):
....tdel = datetime.timedelta(days=dates[i]/24)
....td = off + tdel
....months[i] = td.month
sage: ind = where(raw<0)
sage: data[ind] = nan
And here was the first real bottleneck, as things just slowed to a crawl as python tried to find all the instances where the data was less than zero. This is something that is instantaneous in Matlab, and took over 30 seconds to go through 124*50*151 values. There must be a faster way to do this.
data2=data.copy()
#Take out monthly averages
sage: mclim = ones((50,151))
sage: for i in range(1,13):
....index = where(months==i)[0]
....mclim = nanmean(data[index,:,:])
....data2[index,:,:] = data[index,:,:] - mclim
data2.shape = (ncycles, nx*ny)
ltmean = nanmean(data2) #get mean of each time series
#take out long term mean
sage: anom = data2.copy()
sage: for i in range(0,ncycles):
....anom[i,:] = data2[i,:] - ltmean
sage: EOF = nan_to_num(anom) #push land back to zero
sage: [u,s,v] = linalg.svd(EOF)
sage: for i in range(0,ncycles):#build array so that we can project eigenvalues back onto timeseries
....s2[i,i] = s[i]
sage: amp = dot(s2.transpose(),u.transpose()) #get amplitude
sage: spatial = v[0:4,:]# pull out spatial fields
sage: ratios = pow(s,2)/sum(pow(s,2))*100 #get %variance explained for each mode
sage: temp = spatial[0,:]
sage: temp.shape = (ny,nx) #push back to original dims
sage: plot(amp)
sage: savefig('amplitude.png')
sage: imshow(flipud(temp))
sage: savefig('spatial.png')
Success!
I actually really felt positive about this whole example as I really learned a lot more. This also was probably too large of an array to test out (measure twice cut once!) but it’s what I was working with so I wanted a real world example. The more that I worked in sage the more comfortable I felt as well. The geographic projection issue is still there, as well as some indexing speed issues, but overall, I was really impressed with the Sage/SciPy/NumPy experience today. Overall I feel that more of a transition was made for me last night/today. Which was great timing as a co-worker actually called me and asked if I knew of any free replacements for Matlab…
Technorati Tags:
Computers, linux, matlab, programming, python, sage, scientific programming
July 25, 2008 at 5:33 am
Good to see you’re warming up to Python et al!
It’s strange that it took so long to index into your matrix . . . that size shouldn’t be too large at all.
In IPython on my 5-yr-old machine it’s near-instantaneous:
import numpy
z = numpy.random.standard_normal( (124,50,151) )
timeit z<0
# 100 loops, best of 3: 11.4 ms per loop
I wonder if the bottleneck is in loading the NetCDF data . . . maybe it’s doing some funky “lazy loading” and not actually retrieving the data until you do the data<0 test?
Sidenote: it was easier for me to make the transition easier from Matlab syntax when I found out vectorizing tricks like
data[data<0] = NaN
work with NumPy arrays.
So far it looks like everything you’re doing can be done without needing Sage. I haven’t installed it to test it, but poking around it seems the advantages of Sage are
1) its integration with computer algebra systems (http://www.sagemath.org/doc/html/tut/node46.html)
2) Notebooks
If you can get away with not using either of these, then it might be worth a shot to install the Python packages you’d use individually (NumPy, SciPy, matplotlib, basemap). Then you can always stay on top of the latest versions and bugfixes, and take advantage of high-quality documentation for the separate packages, for example matplotlib’s new docs at http://matplotlib.sourceforge.net/doc/html/users/index.html
-ryan
July 25, 2008 at 8:57 am
Hey Ryan,
again, thanks for stopping by and your insights. I tried your random number example on sage and straight Numpy and got some interesting results. First off timeit would not work in the sage environment, but a simple
z = standard_normal( (124,50,151) )
ind = numpy.where(z<0)
takes 20 seconds in Sage 3.0.5 (running Numpy 1.0.4)
while in standard python (running Numpy 1.1.0) that command is instantaneous.
In [31]: timeit z<0
100 loops, best of 3: 3.93 ms per loop
You are also dead on in regards to not using the major features of Sage which would differentiate it from just using Python+Packages. I feel the same way, but wanted to see if Sage would offer a “one stop shopping” where I could load data, run an EOF (matlab/python), run a Generalized Additive model (R), and plot a nice map (GMT) without having to push the data to three places.
At this point though, the use of an older version of Numpy really seems to be a problem for the applications that I want to run as I cannot use Basemap in matplotlib and the speed issues on indexing arrays.
So I do have Numpy/SciPy/IPython/Basemap setup (no RPy yet) and I think that I will just be content to use these and report that Sage just isn’t for me right now due to the Basemap and speed issues.
I am excited about the possibilities here, and I think that IPython is an amazing IDE to work in (especially ipython -pylab!). I actually *cheated* and used Matlab the other day, and it felt like a step backwards, especially when the Java-based IDE kept freezing on me (coupled with the fact that I *still* can’t run 2008a or 2008b due to my eth0 being diasbled).
Again, thanks for the comments and I’ll try to get some more playtime in with python today.
August 10, 2008 at 12:27 pm
Hello,
We’re planning on upgrading scipy, numpy, and matplotlib to the latest releases in Sage 3.1 which should be out next week.
Regarding the speed of numpy.where(z<0) in Sage, this is because numpy and Sage data types don’t always play nicely together yet.
Sage has a light preparser that does things like replace things that would be Python integers with Sage’s Integer class (which is based on GMP). For example,
sage: preparse(“numpy.where(z<0)”)
‘numpy.where(z<Integer(0))’
I’m not sure where all the time is being spent in the numpy routine. But, there are a couple possible workarounds:
1) Turn off the Sage preprocessor with
sage: preparser(False)
2) Use the ‘r’ to tell the preparser not to change the Python integer to a Sage one:
sage: preparse(“”numpy.where(z<0r)”)
‘numpy.where(z<0)’
3) Replace Integer with int
sage: Integer = int.
The same sort of thing applies with Sage’s RealNumber class and Python’s float.
–Mike
August 12, 2008 at 1:00 pm
Hi Mike,
Thanks! I appreciate the comment and the notes. I’ll be sure to upgrade Sage when ready and try out your suggestions.