This notebook was put together by Jake Vanderplas for UW's Astro 599 course. Source and license info is on GitHub.

In [2]:

%run talktools.py

Advanced String Manipulation¶

& File I/O¶

One of the areas where Python has a distinct (and huge) advantage over lower-level languages like C is in its string manipulation. Operations that are downright painful in other languages can be accomplished very straightforwardly in Python.

The `string` module¶

We can get a preview of what's available by examining the built-in string module

In [3]:

import string
dir(string)

Out[3]:

['ChainMap',
 'Formatter',
 'Template',
 '_TemplateMetaclass',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__initializing__',
 '__loader__',
 '__name__',
 '__package__',
 '_re',
 '_string',
 'ascii_letters',
 'ascii_lowercase',
 'ascii_uppercase',
 'capwords',
 'digits',
 'hexdigits',
 'octdigits',
 'printable',
 'punctuation',
 'whitespace']

Modifying Case:¶

`lower()`, `upper()`, `title()`, `capitalize()`, `swapcase()`¶

In [4]:

s = "HeLLo tHEre MY FriEND"

In [5]:

s.upper()

Out[5]:

'HELLO THERE MY FRIEND'

In [6]:

s.lower()

Out[6]:

'hello there my friend'

In [7]:

s.title()

Out[7]:

'Hello There My Friend'

In [8]:

s.capitalize()

Out[8]:

'Hello there my friend'

In [9]:

s.swapcase()

Out[9]:

'hEllO TheRE my fRIend'

Splitting, Cleaning, and Joining¶

`split()`, `strip()`, `join()`, `replace()`¶

In [10]:

s.split()

Out[10]:

['HeLLo', 'tHEre', 'MY', 'FriEND']

In [12]:

L = s.capitalize().split()
L

Out[12]:

['Hello', 'there', 'my', 'friend']

In [13]:

s = '_'.join(L)
s

Out[13]:

'Hello_there_my_friend'

In [14]:

s.split('_')

Out[14]:

['Hello', 'there', 'my', 'friend']

In [15]:

''.join(s.split('_'))

Out[15]:

'Hellotheremyfriend'

In [16]:

s = "    Too many spaces!    "
s.strip()

Out[16]:

'Too many spaces!'

In [17]:

s = "*~*~*~*Super!!**~*~**~*~**~"
s.strip('*~')

Out[17]:

'Super!!'

In [18]:

s.rstrip('*~')

Out[18]:

'*~*~*~*Super!!'

In [19]:

s.lstrip('*~')

Out[19]:

'Super!!**~*~**~*~**~'

In [20]:

s.replace('*', '')

Out[20]:

'~~~Super!!~~~~~'

In [21]:

s.replace('*', '').replace('~', '')

Out[21]:

'Super!!'

Finding substrings¶

`find()`, `startswith()`, `endswith()`¶

In [22]:

s = "The quick brown fox jumped"
s.find("fox")

Out[22]:

In [23]:

s[16:]

Out[23]:

'fox jumped'

In [24]:

s.find('booyah')

Out[24]:

-1

In [25]:

s.startswith('The')

Out[25]:

True

In [26]:

s.endswith('jumped')

Out[26]:

True

In [27]:

s.endswith('fox')

Out[27]:

False

Checking a string's contents¶

`isdigit()`, `isalpha()`, `islower()`, `isupper()`, `isspace()`, etc.¶

In [28]:

'1234'.isdigit()

Out[28]:

True

In [29]:

'123.45'.isdigit()

Out[29]:

False

In [30]:

'ABC'.isalpha()

Out[30]:

True

In [31]:

'ABC123'.isalpha()

Out[31]:

False

In [32]:

"ABC123".isalnum()

Out[32]:

True

In [33]:

'ABC easy as 123'.isalnum()

Out[33]:

False

In [34]:

'hello'.islower()

Out[34]:

True

In [35]:

'HELLO'.isupper()

Out[35]:

True

In [36]:

'Hello'.istitle()

Out[36]:

True

In [37]:

'   '.isspace()

Out[37]:

True

String Formatting¶

The old way¶

The old-style string formatting operations will look familiar to those who have used C. Essentially, any % in the string indicates a replacement.

Basic interface is

"%(format)" % value

In [38]:

from math import pi
"my favorite integer is %d, but my favorite float is %f." % (42, pi)

Out[38]:

'my favorite integer is 42, but my favorite float is 3.141593.'

In [39]:

"in exponential notation it's %e" % pi

Out[39]:

"in exponential notation it's 3.141593e+00"

In [40]:

"to choose smartly if exponential is needed: %g" % pi

Out[40]:

'to choose smartly if exponential is needed: 3.14159'

In [41]:

"or with a bigger number: %g" % 123456787654321.0

Out[41]:

'or with a bigger number: 1.23457e+14'

In [42]:

"rounded to three decimal places it's %.3f" % pi

Out[42]:

"rounded to three decimal places it's 3.142"

In [43]:

"an integer padded with spaces: %10d" % 42

Out[43]:

'an integer padded with spaces:         42'

In [44]:

"an integer padded on the right: %-10d" % 42

Out[44]:

'an integer padded on the right: 42        '

In [45]:

"an integer padded with zeros: %010d" % 42

Out[45]:

'an integer padded with zeros: 0000000042'

In [46]:

"we can also name our arguments: %(value)d" % dict(value=3)

Out[46]:

'we can also name our arguments: 3'

In [47]:

"Escape the percent sign with an extra symbol: the %d%%" % 99

Out[47]:

'Escape the percent sign with an extra symbol: the 99%'

Read more about formats in the Python docs

Formatting: the new way¶

New-style string formatting uses curly braces {} to contain the formats, which can be referenced by argument number and name:

"{0} {name}".format(first, name=second)"

In [48]:

"{}{}".format("ABC", 123)

Out[48]:

'ABC123'

In [49]:

"{0}{1}".format("ABC", 123)

Out[49]:

'ABC123'

In [50]:

"{0}{0}".format("ABC", 123)

Out[50]:

'ABCABC'

In [51]:

"{1}{0}".format("ABC", 123)

Out[51]:

'123ABC'

Formatting comes after the :

In [52]:

("%.2f" % 3.14159) ==  "{:.2f}".format(3.14159)

Out[52]:

True

In [53]:

"{0:d} is an integer; {1:.3f} is a float".format(42, pi)

Out[53]:

'42 is an integer; 3.142 is a float'

In [54]:

"{the_answer:010d} is an integer; {pi:.5g} is a float".format(the_answer=42,
                                                              pi=pi)

Out[54]:

'0000000042 is an integer; 3.1416 is a float'

In [55]:

'{desire} to {place}'.format(desire='Fly me',
                             place='The Moon')

Out[55]:

'Fly me to The Moon'

In [56]:

# using a pre-defined dictionary
f = {"desire": "Won't you take me",
     "place": "funky town?"}

'{desire} to {place}'.format(**f)

Out[56]:

"Won't you take me to funky town?"

In [57]:

# format also supports binary numbers
"int: {0:d};  hex: {0:x};  oct: {0:o};  bin: {0:b}".format(42)

Out[57]:

'int: 42;  hex: 2a;  oct: 52;  bin: 101010'

File I/O¶

Let's create a file for us to read:

In [58]:

%%file inout.dat
Here is a nice file
with a couple lines of text
it is a haiku

Writing inout.dat

In [60]:

f = open('inout.dat')
print(f.read())
f.close()

Here is a nice file
with a couple lines of text
it is a haiku

In [62]:

f = open('inout.dat')
print(f.readlines())
f.close()

['Here is a nice file\n', 'with a couple lines of text\n', 'it is a haiku']

In [64]:

for line in open('inout.dat'):
    print(line.split())

['Here', 'is', 'a', 'nice', 'file']
['with', 'a', 'couple', 'lines', 'of', 'text']
['it', 'is', 'a', 'haiku']

In [65]:

# write() is the opposite of read()
contents = open('inout.dat').read()
out = open('my_output.dat', 'w')
out.write(contents.replace(' ', '_'))
out.close()

In [66]:

!cat my_output.dat

Here_is_a_nice_file
with_a_couple_lines_of_text
it_is_a_haiku

In [67]:

# writelines() is the opposite of readlines()
lines = open('inout.dat').readlines()
out = open('my_output.dat', 'w')
out.writelines(lines)
out.close()

In [68]:

!cat my_output.dat

Here is a nice file
with a couple lines of text
it is a haiku

Breakout: clearing up some output¶

Here is some code that creates a comma-delimited file of numbers with random precision, leading spaces, and formatting:

In [69]:

# Don't modify this: it simply writes the example file
f = open('messy_data.dat', 'w')
import random
for i in range(100):
    for j in range(5):
        f.write(' ' * random.randint(0, 6))
        f.write('%0*.*g' % (random.randint(8, 12),
                            random.randint(5, 10),
                            100 * random.random()))
        if j != 4:
            f.write(',')
    f.write('\n')
f.close()

In [70]:

# Look at the first four lines of the file:
!head -4 messy_data.dat

     069.40604687,  0094.5912, 96.79042884, 0000055.655,  0023.7310709
 10.52260323,    000032.757,00033.982631,      0090.194719,      43.57646106
      040.527913,      00065.72179,  000086.8327,00011.0367,99.36526435
     00000074.411,      3.816226122,   00047.43759,   000079.62696,  040.8001

Your task: Write a program that reads in the contents of "messy_data.dat" and extracts the numbers from each line, using the string manipulations we used above (remember that float() will convert a suitable string to a floating-point number).

Next write out a new file named "clean_data.dat". The new file should contain the same data as the old file, but with uniform formatting and aligned columns.

In [71]:

# your solution here

The numpy solution¶

What you did above with text wrangling, numpy can do much more easily:

In [72]:

import numpy as np
data = np.loadtxt("messy_data.dat", delimiter=',')
np.savetxt("clean_data.dat", data,
           delimiter=',', fmt="%8.4f")

In [73]:

!head -5 clean_data.dat

 69.4060, 94.5912, 96.7904, 55.6550, 23.7311
 10.5226, 32.7570, 33.9826, 90.1947, 43.5765
 40.5279, 65.7218, 86.8327, 11.0367, 99.3653
 74.4110,  3.8162, 47.4376, 79.6270, 40.8001
 77.2510, 79.3929, 36.7943, 71.0619, 74.8516

Still, text manipulation is a very good skill to have under your belt!