Uncategorized

multi-processing multi-threading programming

Asked 
Viewed 99 times
0

I’m a real beginner at python but I have the following script working successfully. It firstly creates a list of .xml files and then executes the program for each .xml.

Each .xml takes 2-4 minutes to complete and I need to run thousands, so I’ve been trying to speed up my script by using multiprocessing, but it appears beyond my skills.

Any suggestions on how to modify it would be greatly appreciated.

    # import modules
    import os, sys, shutil, subprocess, fnmatch
    from datetime import datetime, timedelta
    from time import strptime

    # Set variables
    project_folder = r"T:\erin\indivs_sample"
    phoenix_exe_file = r'C:\Phoenix\Phoenix.exe'

    # Create definitions

    def runPhoenix(project_file):
        print "Running Phoenix @: " + str(datetime.now().strftime("%a, %d %b %Y %H:%M:%S GMT")) + " - " + project_file
        process = subprocess.Popen([phoenix_exe_file,project_file])
        process.wait() 
        print "Phoenix Complete @: " + str(datetime.now().strftime("%a, %d %b %Y %H:%M:%S GMT"))

    # Create list of XMLs

    project_files = []

    for file_name in os.listdir(project_folder):
        if fnmatch.fnmatch(file_name,'*.xml'):
            file_path = os.path.join(project_folder, file_name)
            project_files.append(file_path)

    # run project files

    for project_file in project_files:
        runPhoenix(project_file)


    print "completed"

EDIT 1: I have managed to re-write my code a bit more in the format I think multiprocessing needs…

# import modules
import os, sys, shutil, subprocess, fnmatch, time
from datetime import datetime, timedelta
from time import strptime

# Set variables
project_folder = r"C:\TheHillsPilot\Phoenix\GeneralRuns\ProjectXMLs\indivs_sample"
phoenix_exe_file = r'C:\Phoenix\Phoenix.exe'

# Create definitions

#  Definition: print messages when using IDLE
def log(msg):
    print msg

#  Definition: Create list of XMLs
def createlist():
    for file_name in os.listdir(project_folder):
        if fnmatch.fnmatch(file_name,'*.xml'):
            file_path = os.path.join(project_folder, file_name)
            project_files.append(file_path)

#  Definition: Run Phoenix
def runPhoenix(project_file):
    log("Running Phoenix @: " + str(datetime.now().strftime("%a, %d %b %Y %H:%M:%S GMT")) + " - " + project_file)
    process = subprocess.Popen([phoenix_exe_file,project_file])
    process.wait()         
    log("Phoenix Complete @: " + str(datetime.now().strftime("%a, %d %b %Y %H:%M:%S GMT")))

#  Definition: main
def main():
    log("creating list of XMLs")
    createlist()
    for project_file in project_files:
        runPhoenix(project_file)

#  Process: Create empty list
project_files = []

#  Process: Run main
if __name__ == '__main__':
    log("True")
    time_start = time.clock()
    main()
    time_end = time.clock()
    log("Time taken in main in seconds(s) is : {}".format(str(time_end - time_start)))

log("completed")

0

— Part 1 Understanding the problem —

Your question looks a bit complicated. Let me see if I understand your Python program correctly. Your program does two main things:

  1. Look into a project folder, and find the xml files than match some criteria, and create a list of file names of the matched xml files.
  2. Use the runPhoenix.exe function to process, or possibly convert each of the xml files in the old list to a new list of “phoenix” files.

I know very little of html and xml, and never heard of phoenix program for xml files.

But I think your problem in general trying to speed up a list of time consuming jobs by executing them in parallel.

Let me give a specific example of your general problem. You have, say, 1,000 text files in English, and you want to translate the English text files into Spanish. For now, you have only one translator doing the job sequentially and it takes a very long time.

So you would like to get say, 4 translators to do the jobs in parallel.

One possible solution is to use the Python multiprocessing package, which can create of a pool of say, 4 translator worker processes doing the jobs at the same time. This way, you can be up to 4 times faster.

If you think I understand your problem correctly, I can suggest a rough Python multiprocessing program example to do some simple text processing for your reference.

— Part 2 Create / Print text files —

# *** Python 3.6.5 Text file create/print functions - tlfong01 2018apr18hkt1521 ***

textFileNameLinelistDict = \
        {
            '0' : {'TextFileName': 'textFile1.txt', 'Linelist' :['line11\n', 'line12\n']},
            '1' : {'TextFileName': 'textFile2.txt', 'Linelist' :['line21\n', 'line22\n']},
        }

def createManyTextFiles(textFileNameLinelistDict):
    for textFileNum in range(len(textFileNameLinelistDict)):
        textFileName = textFileNameLinelistDict[str(textFileNum)]['TextFileName']
        lineList     = textFileNameLinelistDict[str(textFileNum)]['Linelist']       
        with open(textFileName, mode = 'w', encoding = 'utf8') as textFile:
            for line in lineList:
                textFile.write(line)
    return

def printManyTextFiles(textFileNameLinelistDict):
    for textFileNum in range(len(textFileNameLinelistDict)):
        textFileName = textFileNameLinelistDict[str(textFileNum)]['TextFileName']
        print(textFileName)
        with open(textFileName, mode = 'r', encoding = 'utf8') as textFile:
            print(textFile.read())                
    return

def createTwoTextFiles():
    createManyTextFiles(textFileNameLinelistDict)
    return

def printTwoTextFiles():
    printManyTextFiles(textFileNameLinelistDict)
    return

# *** Main ***

def main():
    createTwoTextFiles()
    printTwoTextFiles()
    return    

if __name__ == '__main__':        
    main()

# *** Sampel output ***
textFile1.txt
line11
line12

textFile2.txt
line21
line22    

— Part 3 Translating text files —

Now I have written a ‘translation’ function which inputs a text file and output a new text file with all the lower case letters shifted to upper case.

# *** Python 3.6.5 Text file translate - tlfong01 2018apr18hk1940 ***

textFileDict = \
    {
        '0' : {'InputTextFileName'  : 'inputTextFile1.txt',
               'Linelist'           : ['line11z\n', 'line12z\n'],
               'OutputTextFileName' : 'outputTextFile1.txt',
              },
        '1' : {'InputTextFileName'  : 'inputTextFile2.txt',
               'Linelist'           : ['line21z\n', 'line22z\n'],
               'OutputTextFileName' : 'outputTextFile2.txt',
              }
    }

def createManyTextFiles(textFileDict):
    for textFileNum in range(len(textFileDict)):
        textFileName = textFileDict[str(textFileNum)]['InputTextFileName']
        lineList     = textFileDict[str(textFileNum)]['Linelist']       
        with open(textFileName, mode = 'w', encoding = 'utf8') as textFile:
            for line in lineList:
                textFile.write(line)
    return

def printManyTextFiles(textFileDict, fileNameType):
    for textFileNum in range(len(textFileDict)):
        textFileName = textFileDict[str(textFileNum)][fileNameType]
        print(textFileName)
        with open(textFileName, mode = 'r', encoding = 'utf8') as textFile:
            print(textFile.read())                
    return

def translateManyTextFiles(testFileDict, translateFunction):
    for textFileNum in range(len(textFileDict)):
        inputTextFileName  = textFileDict[str(textFileNum)]['InputTextFileName']
        outputTextFileName = textFileDict[str(textFileNum)]['OutputTextFileName']
        inputTextFile  = open(inputTextFileName,  mode = 'r', encoding = 'utf8') 
        outputTextFile = open(outputTextFileName, mode = 'w', encoding = 'utf8')
        for line in inputTextFile:
            outputTextFile.write(translateFunction(line))
        inputTextFile.close()
        outputTextFile.close()
    return

def shiftUpperCase(string):
    return string.upper()

def createTwoTextFiles():
    createManyTextFiles(textFileDict)
    return

def printTwoInputTextFiles():
    printManyTextFiles(textFileDict, 'InputTextFileName')
    return

def translateTwoTextFiles():
    translateManyTextFiles(textFileDict, shiftUpperCase)
    return

def printTwoOutputTextFiles():
    printManyTextFiles(textFileDict, 'OutputTextFileName')
    return

# *** Main ***

def main():
    createTwoTextFiles()
    printTwoInputTextFiles()
    translateTwoTextFiles()
    printTwoOutputTextFiles()    
    return    

if __name__ == '__main__':        
    main()

# *** Sampel output ***

'''
inputTextFile1.txt
line11z
line12z

inputTextFile2.txt
line21z
line22z

outputTextFile1.txt
LINE11Z
LINE12Z

outputTextFile2.txt
LINE21Z
LINE22Z
'''

— Part 4 Multiprocessing using Pool —

I have written more tests for multiprocessing. So far so good.

# *** Python 3.6.5 multiprocessing translate - tlfong01 2018apr18hk2153 ***

# *** Text file dictionary ***

textFileDict = \
    {
        '0' : {'InputTextFileName'  : 'inputTextFile1.txt',
               'Linelist'           : ['line11x\n', 'line12x\n'],
               'OutputTextFileName' : 'outputTextFile1.txt',
              },
        '1' : {'InputTextFileName'  : 'inputTextFile2.txt',
               'Linelist'           : ['line21y\n', 'line22y\n'],
               'OutputTextFileName' : 'outputTextFile2.txt',
              }
    }

# *** Create text file ***

def createManyTextFiles(textFileDict):
    for textFileNum in range(len(textFileDict)):
        textFileName = textFileDict[str(textFileNum)]['InputTextFileName']
        lineList     = textFileDict[str(textFileNum)]['Linelist']       
        with open(textFileName, mode = 'w', encoding = 'utf8') as textFile:
            for line in lineList:
                textFile.write(line)
    return

# *** Print text files ***

def printOneTextFile(textFileName):
    print(textFileName)
    with open(textFileName, mode = 'r', encoding = 'utf8') as textFile:
        print(textFile.read())
    return    

def printManyTextFiles(textFileDict, fileNameType):
    for textFileNum in range(len(textFileDict)):
        textFileName = textFileDict[str(textFileNum)][fileNameType]
        printOneTextFile(textFileName)              
    return

# *** Sequential translate text files ***

def shiftUpperCase(string):
    return string.upper()

def translateOneTextFile(inputTextFileName, outputTextFileName, translateFunction):
    inputTextFile  = open(inputTextFileName,  mode = 'r', encoding = 'utf8') 
    outputTextFile = open(outputTextFileName, mode = 'w', encoding = 'utf8')
    for line in inputTextFile:
        outputTextFile.write(translateFunction(line))
    inputTextFile.close()
    outputTextFile.close()
    return

def sequentialTranslateManyTextFiles(testFileDict, translateFunction):
    for textFileNum in range(len(textFileDict)):
        inputTextFileName  = textFileDict[str(textFileNum)]['InputTextFileName']
        outputTextFileName = textFileDict[str(textFileNum)]['OutputTextFileName']
        translateOneTextFile(inputTextFileName, outputTextFileName, translateFunction)
    return

def shiftUpperCaseFileNameList(fileNameList):
    translateOneTextFile(fileNameList[0], fileNameList[1], shiftUpperCase)
    return

# *** Test functions ***

def createTwoTextFiles():
    createManyTextFiles(textFileDict)
    return

def printTwoInputTextFiles():
    printManyTextFiles(textFileDict, 'InputTextFileName')
    return

def sequentialTranslateTwoTextFiles():
    sequentialTranslateManyTextFiles(textFileDict, shiftUpperCase)
    return

def printTwoOutputTextFiles():
    printManyTextFiles(textFileDict, 'OutputTextFileName')
    return

def sequentialShiftUpperCaseManyTextFiles():
    inputTextFile1  = 'inputTextFile1.txt' 
    inputTextFile2  = 'inputTextFile2.txt'
    outputTextFile1 = 'outputTetFile1.txt'     
    outputTextFile2 = 'outputTetFile2.txt'    

    fileNameFunctionList0 = [inputTextFile1, outputTextFile1, shiftUpperCase]
    fileNameFunctionList1 = [inputTextFile2, outputTextFile2, shiftUpperCase]    

    shiftUpperCaseFileNameList(fileNameFunctionList0)
    shiftUpperCaseFileNameList(fileNameFunctionList1)  

    printOneTextFile(inputTextFile1)
    printOneTextFile(outputTextFile1)
    printOneTextFile(inputTextFile2)
    printOneTextFile(outputTextFile2)

    return


def parallelShiftUpperCaseManyTextFiles():

    # *** input output file name configuration ***

    inputTextFileName1  = 'inputTextFile1.txt'
    outputTextFileName1 = 'outputTextFile1.txt'
    inputTextFileName2  = 'inputTextFile2.txt'
    outputTextFileName2 = 'outputTextFile3.txt'

    # *** parallel translating the input files ***

    pool = mp.Pool(4)
    pool.map(shiftUpperCaseFileNameList, \
                [
                    [inputTextFileName1, outputTextFileName1],
                    [inputTextFileName2, outputTextFileName2],
                ]
            )

    # *** print input out files ***

    printOneTextFile(inputTextFileName1)
    printOneTextFile(outputTextFileName1)
    printOneTextFile(inputTextFileName2)
    printOneTextFile(outputTextFileName2)     

    return

# *** Main ***

def main():

    # *** Sequential translation using text file dictionary ***

    createTwoTextFiles()
    printTwoInputTextFiles()
    sequentialTranslateTwoTextFiles()
    printTwoOutputTextFiles()

    # *** Sequential shift upper case using text file name lists ***

    sequentialShiftUpperCaseManyTextFiles()

    # *** Parallel shift upper case using text file name lists ***

    parallelShiftUpperCaseManyTextFiles()

    return    

if __name__ == '__main__':        
    main()

# *** Sample output ***

inputTextFile1.txt
line11x
line12x

inputTextFile2.txt
line21y
line22y

outputTextFile1.txt
LINE11X
LINE12X

outputTextFile2.txt
LINE21Y
LINE22Y

inputTextFile1.txt
line11x
line12x

outputTetFile1.txt
LINE11X
LINE12X

inputTextFile2.txt
line21y
line22y

outputTetFile2.txt
LINE21Y
LINE22Y

inputTextFile1.txt
line11x
line12x

outputTextFile1.txt
LINE11X
LINE12X

inputTextFile2.txt
line21y
line22y

outputTextFile3.txt
LINE21Y
LINE22Y

— Part 5 Multiprocessing using Apply Sync —

Last time I used the “Pool” thing to do multiprocessing. This time I am using the “Apply Async” method. I have addded the ApplyAsync segment to the original function, to make it easier to compare and contrast.

enter '''

# *** Python 3.6.5 multiprocessing translate - tlfong01 2018apr20hk1549 ***

def parallelPoolAndApplyAsyncShiftUpperCaseManyTextFiles():

    # *** input output file name configuration ***

    inputTextFileName1  = 'inputTextFile1.txt'
    inputTextFileName2  = 'inputTextFile2.txt'

    outputTextFileName1 = 'outputTextFile8.txt'
    outputTextFileName2 = 'outputTextFile9.txt'

    # *** Using pool to translate the text files ***

    pool = mp.Pool(4)
    pool.map(shiftUpperCaseFileNameList, \
                [
                    [inputTextFileName1, outputTextFileName1],
                    [inputTextFileName2, outputTextFileName2],
                ]
            )

    print('\n*** Using pool to translate the text files ***\n')
    printOneTextFile(inputTextFileName1)
    printOneTextFile(outputTextFileName1)
    printOneTextFile(inputTextFileName2)
    printOneTextFile(outputTextFileName2)     

    # *** Using Apply Async translate the text files ***

    pool = mp.Pool(2)        
    pool.apply_async(shiftUpperCaseFileNameList, ([inputTextFileName1, outputTextFileName1],))
    pool.apply_async(shiftUpperCaseFileNameList, ([inputTextFileName2, outputTextFileName2],))
    pool.close()
    pool.join()    

    # *** print input out files ***

    print('\n*** Using Apply Async translate the text files ***\n')
    printOneTextFile(inputTextFileName1)
    printOneTextFile(outputTextFileName1)
    printOneTextFile(inputTextFileName2)
    printOneTextFile(outputTextFileName2)     

    return

# *** Main ***

def test0():
    createTwoTextFiles()
    #printTwoInputTextFiles()
    #sequentialTranslateTwoTextFiles()
    #printTwoOutputTextFiles()
    #sequentialShiftUpperCaseManyTextFiles()
    parallelPoolAndApplyAsyncShiftUpperCaseManyTextFiles()    
    return

def main():
    test0()      
    return    

if __name__ == '__main__':        
    main()

# *** End ***    

# *** Sample output ***

'''

'''
>>> 
 RESTART: D:\work\rpi3b\programs\parallel_python\programs\mtp01_2018apr1905.py 

*** Using pool to translate the text files ***

inputTextFile1.txt
line11x
line12x

outputTextFile8.txt
LINE11X
LINE12X

inputTextFile2.txt
line21y
line22y

outputTextFile9.txt
LINE21Y
LINE22Y


*** Using Apply Async translate the text files ***

inputTextFile1.txt
line11x
line12x

outputTextFile8.txt
LINE11X
LINE12X

inputTextFile2.txt
line21y
line22y

outputTextFile9.txt
LINE21Y
LINE22Y

>>> 

''' 

— Part 6 —

/To be continued, …

  • Yes please! You’ve got the idea of it – the program “Phoenix.exe” opens each xml in turn and spits out some result files. So I’m looking for a script that will do this in parallel rather than sequential… just as you’ve suggested. I’ve tried the multiprocessing package myself but it is beyond my ability to make work. – ErinH Apr 18 ’18 at 3:09
  • Good. So I shall move on to do some testing work. I will write some functions to create and print test files, and then a sort of translator to translate the text files. To start with, the translator would be one input file to one output file. If things go well, it wouldn’t be hard to change the translator to one file in, many files out. – tlfong01 Apr 18 ’18 at 8:10   
  • Now I have tried both the Pool and the Apply Sync method to do multiprocessing. It is a bit hard for a python beginner to digest it all. So I will take a break and let you take your time slowly go through the messy functions, and perhaps later add your sequential and parallel timing comparison code to check out if parallel execution in this case indeed saves some time. Happy python programming! – tlfong01 Apr 20 ’18 at 8:09    
  • Thanks for your response. You lost me with the translation bit… I’ll have to come back and look at it again soon. Cheers – ErinH May 2 ’18 at 3:31

Categories: Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: