How to Read a Part of File in Python 3.7

How to extract specific portions of a text file using Python

Updated: 06/30/2020 past Figurer Hope

Python programming language logo

Extracting text from a file is a common task in scripting and programming, and Python makes it easy. In this guide, we'll talk over some simple means to extract text from a file using the Python 3 programming linguistic communication.

Make sure yous're using Python three

In this guide, nosotros'll exist using Python version iii. Most systems come pre-installed with Python ii.7. While Python 2.seven is used in legacy code, Python iii is the present and time to come of the Python language. Unless you lot have a specific reason to write or back up Python ii, we recommend working in Python 3.

For Microsoft Windows, Python three tin can be downloaded from the Python official website. When installing, brand sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, as shown in the image below.

Installing Python 3.7.2 for Windows.

On Linux, you lot can install Python 3 with your package managing director. For instance, on Debian or Ubuntu, you can install it with the post-obit command:

sudo apt-get update && sudo apt-get install python3

For macOS, the Python 3 installer can exist downloaded from python.org, as linked to a higher place. If you are using the Homebrew package manager, it tin can also exist installed by opening a terminal window (ApplicationsUtilities), and running this command:

brew install python3

Running Python

On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if yous installed the launcher, the command is py. The commands on this page apply python3; if you're on Windows, substitute py for python3 in all commands.

Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If you accidentally enter the interpreter, you can get out it using the command get out() or quit().

Running Python with a file name will interpret that python plan. For example:

python3 program.py

...runs the program contained in the file program.py.

Okay, how tin can we use Python to extract text from a text file?

Reading data from a text file

First, let's read a text file. Let'southward say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum instance text.

Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

Note

In all the examples that follow, nosotros work with the iv lines of text contained in this file. Copy and paste the latin text above into a text file, and salvage it every bit lorem.txt, so you can run the instance code using this file as input.

A Python programme can read a text file using the built-in open() office. For instance, the Python 3 programme below opens lorem.txt for reading in text mode, reads the contents into a string variable named contents, closes the file, and prints the information.

myfile = open up("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read()         # read the entire file to string myfile.close()                   # shut the file impress(contents)                  # print string contents

Hither, myfile is the proper name we requite to our file object.

The "rt" parameter in the open() function means "we're opening this file to read text data"

The hash mark ("#") ways that everything on that line is a comment, and it's ignored by the Python interpreter.

If you save this program in a file chosen read.py, you tin can run it with the following control.

python3 read.py

The control higher up outputs the contents of lorem.txt:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.

Using "with open"

It's important to close your open up files as soon as possible: open the file, perform your operation, and close it. Don't leave it open for extended periods of time.

When you're working with files, information technology's skilful exercise to use the with open...every bit compound argument. It'due south the cleanest way to open a file, operate on information technology, and close the file, all in 1 piece of cake-to-read cake of lawmaking. The file is automatically closed when the code block completes.

Using with open up...equally, we can rewrite our program to look like this:

with open up ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading text     contents = myfile.read()              # Read the unabridged file to a cord impress(contents)                           # Print the string

Notation

Indentation is important in Python. Python programs utilise white space at the beginning of a line to ascertain scope, such every bit a block of code. We recommend you utilise four spaces per level of indentation, and that you use spaces rather than tabs. In the following examples, make sure your code is indented exactly equally it's presented hither.

Example

Save the program as read.py and execute information technology:

python3 read.py

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.

Reading text files line-by-line

In the examples so far, nosotros've been reading in the whole file at in one case. Reading a full file is no large deal with modest files, but mostly speaking, it'south not a great idea. For one thing, if your file is bigger than the corporeality of available memory, you'll encounter an mistake.

In nearly every case, it's a meliorate thought to read a text file one line at a time.

In Python, the file object is an iterator. An iterator is a type of Python object which behaves in certain ways when operated on repeatedly. For instance, you can use a for loop to operate on a file object repeatedly, and each fourth dimension the aforementioned functioning is performed, you'll receive a different, or "next," issue.

Example

For text files, the file object iterates one line of text at a time. It considers one line of text a "unit" of data, so nosotros tin can utilize a for...in loop statement to iterate ane line at a time:

with open ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading     for myline in myfile:              # For each line, read to a string,         print(myline)                  # and print the cord.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

Observe that we're getting an extra line interruption ("newline") after every line. That'south because 2 newlines are being printed. The get-go one is the newline at the stop of every line of our text file. The second newline happens because, by default, print() adds a linebreak of its own at the end of whatever you've asked it to impress.

Let's store our lines of text in a variable — specifically, a list variable — and so we tin expect at it more closely.

Storing text data in a variable

In Python, lists are similar to, but not the same equally, an array in C or Java. A Python list contains indexed data, of varying lengths and types.

Example

mylines = []                             # Declare an empty list named mylines. with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text data.     for myline in myfile:                # For each line, stored as myline,         mylines.append(myline)           # add its contents to mylines. impress(mylines)                           # Print the list.

The output of this plan is a little different. Instead of printing the contents of the list, this program prints our list object, which looks similar this:

Output:

['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\n', 'Quisque at dignissim lacus.\n']

Hither, nosotros run across the raw contents of the list. In its raw object course, a list is represented as a comma-delimited list. Here, each element is represented as a string, and each newline is represented equally its escape character sequence, \n.

Much like a C or Java array, the list elements are accessed by specifying an index number after the variable proper name, in brackets. Index numbers get-go at zero — other words, the nthursday element of a listing has the numeric index n-1.

Notation

If you're wondering why the index numbers get-go at zero instead of one, you're not lone. Computer scientists take debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why zilch-based numbering is the best style to index information in informatics. You tin can read the memo yourself — he makes a compelling statement.

Example

Nosotros tin can impress the outset element of lines by specifying index number 0, independent in brackets afterwards the name of the list:

print(mylines[0])

Output:

Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.

Instance

Or the third line, by specifying index number 2:

impress(mylines[2])

Output:

Quisque at dignissim lacus.

But if we try to access an index for which there is no value, nosotros get an mistake:

Instance

print(mylines[3])

Output:

Traceback (most recent call final): File <filename>, line <linenum>, in <module> print(mylines[iii]) IndexError: list index out of range

Example

A list object is an iterator, and then to print every element of the list, we can iterate over it with for...in:

mylines = []                              # Declare an empty list with open ('lorem.txt', 'rt') as myfile:  # Open lorem.txt for reading text.     for line in myfile:                   # For each line of text,         mylines.suspend(line)              # add that line to the list.     for element in mylines:               # For each chemical element in the list,         print(element)                    # print it.

Output:

Lorem ipsum dolor sit down amet, consectetur adipiscing elit.  Nunc fringilla arcu congue metus aliquam mollis.  Mauris nec maximus purus. Maecenas sit amet pretium tellus.  Quisque at dignissim lacus.

Only nosotros're notwithstanding getting extra newlines. Each line of our text file ends in a newline grapheme ('\n'), which is existence printed. Also, after printing each line, print() adds a newline of its own, unless y'all tell it to practise otherwise.

We can change this default behavior by specifying an end parameter in our print() call:

print(element, end='')

By setting end to an empty cord (two unmarried quotes, with no space), we tell print() to print nothing at the finish of a line, instead of a newline character.

Case

Our revised program looks like this:

mylines = []                              # Declare an empty listing with open ('lorem.txt', 'rt') as myfile:  # Open file lorem.txt     for line in myfile:                   # For each line of text,         mylines.append(line)              # add together that line to the list.     for element in mylines:               # For each element in the list,         print(element, finish='')            # impress information technology without extra newlines.

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.

The newlines you see here are really in the file; they're a special grapheme ('\northward') at the end of each line. We desire to get rid of these, so we don't have to worry about them while we process the file.

How to strip newlines

To remove the newlines completely, we can strip them. To strip a string is to remove one or more characters, normally whitespace, from either the kickoff or finish of the string.

Tip

This process is sometimes as well called "trimming."

Python iii string objects have a method chosen rstrip(), which strips characters from the right side of a string. The English language reads left-to-right, and so stripping from the correct side removes characters from the end.

If the variable is named mystring, we can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For case, "123abc".rstrip("bc") returns 123a.

Tip

When you stand for a string in your program with its literal contents, it'due south called a string literal. In Python (every bit in nearly programming languages), cord literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you can utilize one or the other, as long as they match on both ends of the string. It's traditional to correspond a man-readable cord (such as Howdy) in double-quotes ("Howdy"). If yous're representing a single character (such as b), or a unmarried special grapheme such as the newline character (\n), it's traditional to use unmarried quotes ('b', '\n'). For more than information almost how to use strings in Python, yous can read the documentation of strings in Python.

The statement string.rstrip('\due north') will strip a newline grapheme from the right side of cord. The following version of our plan strips the newlines when each line is read from the text file:

mylines = []                                # Declare an empty list. with open ('lorem.txt', 'rt') as myfile:    # Open lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.append(myline.rstrip('\n')) # strip newline and add to list. for element in mylines:                     # For each element in the list,     print(chemical element)                          # print it.

The text is now stored in a list variable, so private lines can be accessed by index number. Newlines were stripped, so we don't have to worry about them. Nosotros can always put them dorsum subsequently if we reconstruct the file and write it to disk.

Now, let'south search the lines in the list for a specific substring.

Searching text for a substring

Let'due south say we want to locate every occurrence of a certain phrase, or even a single letter of the alphabet. For instance, maybe we need to know where every "east" is. We can reach this using the cord's find() method.

The list stores each line of our text as a string object. All string objects have a method, find(), which locates the start occurrence of a substrings in the string.

Let'southward utilise the find() method to search for the letter "e" in the showtime line of our text file, which is stored in the listing mylines. The first chemical element of mylines is a string object containing the starting time line of the text file. This cord object has a find() method.

In the parentheses of find(), nosotros specify parameters. The commencement and only required parameter is the string to search for, "e". The statement mylines[0].detect("e") tells the interpreter to search forward, starting at the get-go of the string, ane character at a fourth dimension, until it finds the alphabetic character "e." When information technology finds one, information technology stops searching, and returns the index number where that "e" is located. If it reaches the end of the string, it returns -1 to indicate aught was constitute.

Case

print(mylines[0].find("e"))

Output:

3

The return value "3" tells u.s. that the letter "e" is the fourth character, the "east" in "Lorem". (Call back, the alphabetize is zero-based: index 0 is the get-go character, 1 is the 2nd, etc.)

The find() method takes two optional, boosted parameters: a beginning index and a stop alphabetize, indicating where in the string the search should begin and end. For example, string.find("abc", 10, 20) searches for the substring "abc", just only from the 11th to the 21st character. If stop is non specified, find() starts at index offset, and stops at the finish of the cord.

Example

For instance, the following argument searchs for "e" in mylines[0], beginning at the fifth graphic symbol.

print(mylines[0].find("e", 4))

Output:

24

In other words, starting at the 5th grapheme in line[0], the first "e" is located at index 24 (the "due east" in "nec").

Example

To kickoff searching at alphabetize 10, and stop at alphabetize 30:

print(mylines[ane].notice("e", 10, 30))

Output:

28

(The first "e" in "Maecenas").

If detect() doesn't locate the substring in the search range, it returns the number -one, indicating failure:

print(mylines[0].find("e", 25, 30))

Output:

-1

There were no "e" occurrences between indices 25 and 30.

Finding all occurrences of a substring

But what if nosotros want to locate every occurrence of a substring, not just the first one nosotros encounter? We can iterate over the string, starting from the alphabetize of the previous lucifer.

In this example, nosotros'll use a while loop to repeatedly find the letter "e". When an occurrence is institute, we call notice once again, starting from a new location in the string. Specifically, the location of the concluding occurrence, plus the length of the string (so we can move forward past the last one). When find returns -i, or the offset index exceeds the length of the string, we terminate.

# Build array of lines from file, strip newlines  mylines = []                                # Declare an empty list. with open ('lorem.txt', 'rt') as myfile:    # Open up lorem.txt for reading text.     for myline in myfile:                   # For each line in the file,         mylines.suspend(myline.rstrip('\due north')) # strip newline and add to list.  # Locate and print all occurences of letter of the alphabet "e"  substr = "e"                  # substring to search for. for line in mylines:          # cord to exist searched   index = 0                   # electric current index: character being compared   prev = 0                    # previous index: last grapheme compared   while alphabetize < len(line):    # While index has not exceeded string length,     index = line.observe(substr, index)  # set index to first occurrence of "east"     if index == -1:           # If zippo was institute,       break                   # leave the while loop.     print(" " * (index - prev) + "east", stop='')  # impress spaces from previous                                                # friction match, then the substring.     prev = alphabetize + len(substr)       # call up this position for next loop.     index += len(substr)      # increase the index by the length of substr.                               # (Repeat until alphabetize > line length)   impress('\n' + line);         # Print the original string under the e's        

Output:

          eastward                    e       due east  east               e Lorem ipsum dolor sit down amet, consectetur adipiscing elit.                          eastward  e Nunc fringilla arcu congue metus aliquam mollis.         e                   e e          due east    due east      due east Mauris nec maximus purus. Maecenas sit amet pretium tellus.       e Quisque at dignissim lacus.

Incorporating regular expressions

For complex searches, use regular expressions.

The Python regular expressions module is called re. To use it in your program, import the module earlier you employ it:

import re

The re module implements regular expressions by compiling a search pattern into a blueprint object. Methods of this object can then be used to perform friction match operations.

For example, permit'southward say yous desire to search for any discussion in your certificate which starts with the letter d and ends in the letter r. We can accomplish this using the regular expression "\bd\west*r\b". What does this mean?

character sequence meaning
\b A give-and-take boundary matches an empty string (anything, including zero at all), but only if it appears before or afterward a non-discussion character. "Word characters" are the digits 0 through ix, the lowercase and capital letter letters, or an underscore ("_").
d Lowercase alphabetic character d.
\w* \w represents any word grapheme, and * is a quantifier significant "zero or more of the previous character." And so \w* will match zero or more word characters.
r Lowercase letter of the alphabet r.
\b Give-and-take purlieus.

So this regular expression will match whatever cord that can be described as "a give-and-take boundary, then a lowercase 'd', and then null or more word characters, then a lowercase 'r', then a word boundary." Strings described this way include the words destroyer, bleak, and dr., and the abbreviation dr.

To use this regular expression in Python search operations, we commencement compile information technology into a blueprint object. For example, the following Python statement creates a blueprint object named blueprint which we tin use to perform searches using that regular expression.

pattern = re.compile(r"\bd\due west*r\b")

Note

The letter r before our string in the statement above is of import. It tells Python to interpret our string equally a raw string, exactly as we've typed it. If we didn't prefix the string with an r, Python would interpret the escape sequences such every bit \b in other means. Whenever you demand Python to translate your strings literally, specify it as a raw cord past prefixing it with r.

Now nosotros can use the pattern object'due south methods, such as search(), to search a string for the compiled regular expression, looking for a match. If information technology finds one, it returns a special result called a match object. Otherwise, it returns None, a born Python constant that is used like the boolean value "faux".

import re str = "Good morn, doctor." pat = re.compile(r"\bd\w*r\b")  # compile regex "\bd\west*r\b" to a blueprint object if pat.search(str) != None:     # Search for the pattern. If plant,     print("Found information technology.")

Output:

Constitute it.

To perform a case-insensitive search, you can specify the special constant re.IGNORECASE in the compile step:

import re str = "How-do-you-do, Medico." pat = re.compile(r"\bd\due west*r\b", re.IGNORECASE)  # upper and lowercase will lucifer if pat.search(str) != None:     print("Institute it.")

Output:

Plant information technology.

Putting it all together

So now nosotros know how to open a file, read the lines into a list, and locate a substring in any given list element. Let's apply this knowledge to build some example programs.

Impress all lines containing substring

The program beneath reads a log file line past line. If the line contains the word "error," it is added to a list called errors. If non, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search case-insensitive without altering the original strings.

Note that the find() method is called direct on the event of the lower() method; this is called method chaining. Also, annotation that in the print() argument, nosotros construct an output cord by joining several strings with the + operator.

errors = []                       # The list where we will shop results. linenum = 0 substr = "error".lower()          # Substring to search for. with open up ('logfile.txt', 'rt') as myfile:     for line in myfile:         linenum += 1         if line.lower().discover(substr) != -ane:    # if example-insensitive match,             errors.append("Line " + str(linenum) + ": " + line.rstrip('\n')) for err in errors:     impress(err)

Input (stored in logfile.txt):

This is line 1 This is line 2 Line 3 has an error! This is line 4 Line five too has an error!

Output:

Line iii: Line 3 has an mistake! Line 5: Line 5 also has an mistake!

Extract all lines containing substring, using regex

The program below is similar to the above programme, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.g., (linenum, line). The tuple is created past the boosted enclosing parentheses in the errors.append() argument. The elements of the tuple are referenced similar to a list, with a zero-based index in brackets. Equally synthetic here, err[0] is a linenum and err[1] is the associated line containing an fault.

import re errors = [] linenum = 0 design = re.compile("error", re.IGNORECASE)  # Compile a case-insensitive regex with open ('logfile.txt', 'rt') as myfile:         for line in myfile:         linenum += i         if pattern.search(line) != None:      # If a lucifer is constitute              errors.append((linenum, line.rstrip('\due north'))) for err in errors:                            # Iterate over the list of tuples     impress("Line " + str(err[0]) + ": " + err[1])

Output:

Line 6: Mar 28 09:10:37 Error: cannot contact server. Connectedness refused. Line x: Mar 28 x:28:15 Kernel fault: The specified location is non mounted. Line xiv: Mar 28 eleven:06:30 ERROR: usb 1-one: tin can't set config, exiting.

Excerpt all lines containing a phone number

The program beneath prints whatever line of a text file, info.txt, which contains a Us or international phone number. Information technology accomplishes this with the regular expression "(\+\d{i,two})?[\s.-]?\d{3}[\s.-]?\d{4}". This regex matches the post-obit phone number notations:

  • 123-456-7890
  • (123) 456-7890
  • 123 456 7890
  • 123.456.7890
  • +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{ane,2})?[\s.-]?\d{3}[\due south.-]?\d{4}") with open ('info.txt', 'rt') every bit myfile:     for line in myfile:         linenum += ane         if pattern.search(line) != None:  # If pattern search finds a match,             errors.append((linenum, line.rstrip('\n'))) for err in errors:     print("Line ", str(err[0]), ": " + err[1])

Output:

Line  3 : My phone number is 731.215.8881. Line  7 : Yous can reach Mr. Walters at (212) 558-3131. Line  12 : His agent, Mrs. Kennedy, tin exist reached at +12 (123) 456-7890 Line  14 : She can also be contacted at (888) 312.8403, extension 12.

Search a dictionary for words

The program below searches the lexicon for any words that start with h and cease in pe. For input, information technology uses a dictionary file included on many Unix systems, /usr/share/dict/words.

import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\w*pe$", re.IGNORECASE) with open(filename, "rt") as myfile:     for line in myfile:         if pattern.search(line) != None:             print(line, end='')

Output:

Promise heliotrope hope hornpipe horoscope hype

kingflarapt.blogspot.com

Source: https://www.computerhope.com/issues/ch001721.htm

0 Response to "How to Read a Part of File in Python 3.7"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel