Home » Data from text file to dataframe

Data from text file to dataframe

Solutons:


Here you go! It’s really fast.

For a file with ~11 million lines (made by copying and pasting your sample file over and over again), it took about 22 seconds on my machine, and produced a dataframe with 2.2 million rows.

Note: I wasn’t sure quite how to handle the Program column, because in your expected dataframe none of the values in it ends with _AV, but your text file does, and I wasn’t sure what your rules are regarding that.

import pandas as pd
import json
import re

from numpy import nan

file="test.txt"

with open(file) as f:
    lines = f.readlines()

# To store the final data before feeding it to the dataframe
dct = {
    'Job': [],
    'Program': [],
    'Type': [],
    'Stoolname': [],
    'Times': [],
    'Min/Max': [],
    'Stool': [],
    'Number': [],
}

# To keep track of missing values
counts = {}

field_re = re.compile(r'^[a-z/]+:', re.IGNORECASE)
type_change_re = re.compile(r'^[d.: -]+w+$', re.IGNORECASE)

# This will keep a list of names of keys that we've encountered since this item
# started. We need this because there is no delimiter between objects in the text file.
# (Using a dict like a list here, because dicts are much faster to search
# (their keys) than lists)
hit_fields = {}

# Use a dict like a list here (see above)
special_fields = {
    'Job': None,
    'Program': None,
    'Type': None,
}

last_type=""
last_job = ''
last_program = ''

for line in lines:
    line = line.strip().strip(';')
    if field_re.search(line) is not None:
        k, v = line.split(': ')
        if k in hit_fields:

            # We've found a new item. Add all the accumulated fields to dct
            for field in hit_fields:
                dct[field].append(hit_fields[field])
            for field in dct:
                if field not in hit_fields and field not in special_fields:
                    dct[field].append(nan)
            hit_fields = {}

            dct['Job'].append(last_job)
            dct['Program'].append(last_program)
            dct['Type'].append(last_type)

        hit_fields[k] = v

    elif ' Start Job: ' in line:
        # Change Job and Program
        job = line.split(' Start Job: ')[1]
        if job.endswith('.ldt'):
            job = job[:-4]
        last_job = job.split('_')[0]
        last_program = job

    elif type_change_re.match(line) is not None:
        # Change Type
        last_type = line.split(' ')[-1]

# Finish (sorry for the duplicated code here, I couldn't figure out how to optimize it)
for field in hit_fields:
    dct[field].append(hit_fields[field])
for field in dct:
    if field not in hit_fields and field not in special_fields:
        dct[field].append(nan)
dct['Job'].append(last_job)
dct['Program'].append(last_program)
dct['Type'].append(last_type)

######################################

# Save it to a file:
with open('data.json', 'w') as f:
    json.dump(dct, f)

# Or load it into a dataframe
df = pd.DataFrame(dct)
print(df)

Related Solutions

Why would anyone choose not to use the lowlatency kernel?

The different configurations, “generic”, “lowlatency” (as configured in Ubuntu), and RT (“real-time”), are all about balancing throughput versus latency. Generic kernels favour throughput over latency, the others favour latency over throughput. Thus users who...

How can I update all Snap packages?

sudo snap refresh Will do this. It is part of snapd 2.0.8, which landed 2016-06-13 in xenial-updates. snap refresh --list Only lists the updates without refreshing the packages. snap info <snap name> Can show which versions are available for a particular...

What does Controls.Add() do in c#? [closed]

Controls is an instance of Control.ControlCollection class, which represents a collection of Control objects, Inheritance hierarchy is System.Windows.Forms.Control.ControlCollection Note: The Add, Remove, and RemoveAt methods enable you to add and remove...

How can I change the date modified/created of a file?

As long as you are the owner of the file (or root), you can change the modification time of a file using the touch command: touch filename By default this will set the file's modification time to the current time, but there are a number of flags, such as the -d...

How to read dmesg from previous session? (dmesg.0)

Although a bit late for the OP... I use Fedora, but if your system uses journalctl then you can easily get the kernel messages (dmesg log) from prior shutdown/crash (in a dmesg -T format) through the following. Options: -k (dmesg) -b < boot_number > (How...

Get data on daily basis in laravel

You can use the existing Laravel cron job scheduling to fulfill your specific requirement. Please refer following Laravels official documentation https://laravel.com/docs/5.8/scheduling#scheduling-queued-jobs I'll show a simple example to get an idea about this...

Python speed testing – Time Difference – milliseconds

datetime.timedelta is just the difference between two datetimes ... so it's like a period of time, in days / seconds / microseconds >>> import datetime >>> a = datetime.datetime.now() >>> b = datetime.datetime.now() >>> c = b...

Discovering metadata about a PDF

One of the canonical tools for this is pdfinfo, which comes with xpdf, if I recall. Example output: [0 1017 17:10:17] ~/temp % pdfinfo test.pdf Creator: TeX Producer: pdfTeX-1.40.14 CreationDate: Sun May 18 09:53:06 2014 ModDate: Sun May 18 09:53:06 2014...

How can I search within the output buffer of a tmux shell?

copy mode search To search in the tmux history buffer for the current window, press Ctrl-b [ to enter copy mode. If you're using emacs key bindings (the default), press Ctrl-s then type the string to search for and press Enter. Press n to search for the same...

Often big numbers become negative

This image shows what you're looking for. In your case it's obviously larger numbers, but the principle stays the same. Examples of limits in java are: int: −2,147,483,648 to 2,147,483,647. long: -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 In the...

null pointer exception in java servlet [closed]

I got a "null pointer exception" fault in java servlet. Could someone tell me what happens? And how to avoid that? That happens when you're trying to access/invoke some reference which is actually null. SomeObject someObject = null; someObject.doSomething(); //...

How to set ulimits on service with systemd?

The mappings of systemd limits to ulimit Directive ulimit equivalent Unit LimitCPU= ulimit -t Seconds LimitFSIZE= ulimit -f Bytes LimitDATA= ulimit -d Bytes LimitSTACK= ulimit -s Bytes LimitCORE= ulimit -c Bytes LimitRSS= ulimit -m Bytes LimitNOFILE= ulimit -n...

Does compression option -z with rsync speed up backup

It's a general question. Does compression and decompression at endpoints improve the effective bandwidth of a link? The effective (perceived) bandwith of a link doing compression and decompression at endpoints is a function of: how fast you can compress (your...

How to pre-download items from a JSON list array in React JS?

You can insert <link rel="prefetch"> elements into the <head> of the page. This will tell the browser to go ahead and download the thing that it finds in the src property of that element so that it can be served from the cache if something else on...

C programming: Use of undeclared identifier [closed]

There are a lot of error in this program. In c programming, before using a variable, we must explicitly declare the type of data that it can store. So you must define the type of x and y to integer type (int x = 10,int y=12). The next thing is that you are...