readexcel - simple class for extracting data from an Excel File

readexcel is a somewhat streamlined rewrite of this small python wrapper class by Bryan Niederberger on top of John Machin's xlrd package, intending to make reading worksheets in an Excel file even simpler and more convenient.

The readexcel class is included in and is used by the small ExcelMailer command line business tool for emailing personalized messages with attachments. It reads its data and options directly out of an Excel file, on any platform. This also means that to read the Excel file it does not need Windows or MS Office.

The rewrite consists mostly of (a) removing attributes and methods on the wrapper class that seemed to add little or nothing on the equivalent that are already provided by xlrd and (b) various other code refactoring, e.g removing the per-worksheet cache for column names, factoring out the check for is_data_row(), adding comment rows, and in general using method names that seem to be clearer. The resulting code seems clearer and shorter -- even while adding a little in functionality it comes in at exactly 100 lines (while the original being 150+ lines long).

A big thank you to Bryan and John for providing these two tools: Bryan's original readexcel class saved me a lot of time and introduced me to the xlrd package; and without John's xlrd package ExcelMailer would not have been possible in the first place.

The extracted source code for the readexcel class is below:

class readexcel(object): """ Simple OS-independent class for extracting data from an Excel File. Uses the xlrd module (version 0.5.2 or later), supporting Excel versions: 2004, 2002, XP, 2000, 97, 95, 5, 4, 3 Data is extracted via iterators that return one row at a time -- either as a dict or as a list. The dict generator assumes that the worksheet is in tabular format with the first "data" row containing the variable names and all subsequent rows containing values. Extracted data is represented fairly logically. By default dates are returned as strings in "yyyy/mm/dd" format or "yyyy/mm/dd hh:mm:ss", as appropriate. However, when specifying date_as_tuple=True, dates will be returned as a (Year, Month, Day, Hour, Min, Second) tuple, for usage with mxDateTime or DateTime. Numbers are returned as either INT or FLOAT, whichever is needed to support the data. Text, booleans, and error codes are also returned as appropriate representations. Quick Example: xls = readexcel('testdata.xls') for sname in xls.book.sheet_names(): for row in xls.iter_dict(sname): print row """ def __init__(self, filename): """ Wraps an XLRD book """ if not os.path.isfile(filename): raise ValueError, "%s is not a valid filename" % filename self.book = xlrd.open_workbook(filename) self.sheet_keys = {} def is_data_row(self, sheet, i): values = sheet.row_values(i) if isinstance(values[0], basestring) and values[0].startswith('#'): return False # ignorable comment row for v in values: if bool(v): return True #+ row full of (valid) False values? return False def _parse_row(self, sheet, row_index, date_as_tuple): """ Sanitize incoming excel data """ # Data Type Codes: # EMPTY 0 # TEXT 1 a Unicode string # NUMBER 2 float # DATE 3 float # BOOLEAN 4 int; 1 means TRUE, 0 means FALSE # ERROR 5 values = [] for type, value in zip( sheet.row_types(row_index), sheet.row_values(row_index)): if type == 2: if value == int(value): value = int(value) elif type == 3: datetuple = xlrd.xldate_as_tuple(value, self.book.datemode) if date_as_tuple: value = datetuple else: # time only no date component if datetuple[0] == 0 and datetuple[1] == 0 and \ datetuple[2] == 0: value = "%02d:%02d:%02d" % datetuple[3:] # date only, no time elif datetuple[3] == 0 and datetuple[4] == 0 and \ datetuple[5] == 0: value = "%04d/%02d/%02d" % datetuple[:3] else: # full date value = "%04d/%02d/%02d %02d:%02d:%02d" % datetuple elif type == 5: value = xlrd.error_text_from_code[value] values.append(value) return values def iter_dict(self, sname, date_as_tuple=False): """ Iterator for the worksheet's rows as dicts """ sheet = self.book.sheet_by_name(sname) # XLRDError # parse first row, set dict keys & first_row_index keys = [] first_row_index = None for i in range(sheet.nrows): if self.is_data_row(sheet, i): headings = self._parse_row(sheet, i, False) for j, var in enumerate(headings): # replace duplicate headings with "F#". if not var or var in keys: var = u'F%s' % (j) keys.append(var.strip()) first_row_index = i + 1 break self.sheet_keys[sname] = keys # generate a dict per data row if first_row_index is not None: for i in range(first_row_index, sheet.nrows): if self.is_data_row(sheet, i): yield dict(map(None, keys, self._parse_row(sheet, i, date_as_tuple))) def iter_list(self, sname, date_as_tuple=False): """ Iterator for the worksheet's rows as lists """ sheet = self.book.sheet_by_name(sname) # XLRDError for i in range(sheet.nrows): if self.is_data_row(sheet, i): yield self._parse_row(sheet, i, date_as_tuple)

Your comments are welcome.