Blog AI/ML Learning Python with a little help from AI
Published on: November 9, 2023
31 min read

Learning Python with a little help from AI

Use this guided tutorial, along with GitLab Duo Code Suggestions, to learn a new programming language.

aipower.jpeg

Learning a new programming language can help broaden your software development expertise, open career opportunities, or create fun challenges. However, it can be difficult to decide on one specific approach to learning a new language. Artificial intelligence (AI) can help. In this tutorial, you'll learn how to leverage AI-powered GitLab Duo Code Suggestions for a guided experience in learning the Python programming language with a pratical hands-on example.

Preparations

Choose your preferred and supported IDE, and follow the documentation to enable Code Suggestions for GitLab.com SaaS or GitLab self-managed instances.

Programming languages can require installing the language interpreter command-line tools or compilers that generate binaries from source code to build and run the application.

Tip: You can also use GitLab Remote Development workspaces to create your own cloud development environments, instead of local development environments. This blog post focuses on using VS Code and the GitLab Web IDE.

VS Code

Install VS Code on your client, and open it. Navigate to the Extensions menu and search for gitlab workflow. Install the GitLab Workflow extension for VS Code. VS Code will also detect the programming languages, and offer to install additional plugins for syntax highlighting and development experience. For example, install the Python extension.

Code Suggestions

Familiarize yourself with suggestions before actually verifying the suggestions. GitLab Duo Code Suggestions are provided as you type, so you do not need use specific keyboard shortcuts. To accept a code suggestion, press the tab key. Also note that writing new code works more reliably than refactoring existing code. AI is non-deterministic, which means that the same suggestion may not be repeated after deleting the code suggestion. While Code Suggestions is in Beta, we are working on improving the accuracy of generated content overall. Please review the known limitations, as this could affect your learning experience.

Tip: The latest release of Code Suggestions supports multiline instructions. You can refine the specifications to your needs to get better suggestions. We will practice this method throughout the blog post.

Learning a new programming language: Python

Now, let's dig into learning Python, which is one of the supported languages in Code Suggestions.

Before diving into the source code, make sure to set up your development environment.

Development environment for Python

  1. Create a new project learn-python-ai in GitLab, and clone the project into your development environment. All code snippets are available in this "Learn Python with AI" project.
git clone https://gitlab.com/NAMESPACE/learn-python-ai.git

cd learn-python-ai

git status
  1. Install Python and the build toolchain. Example on macOS using Homebrew:
brew install python
  1. Consider adding a .gitignore file for Python, for example this .gitignore template for Python.

You are all set to learn Python!

Hello, World

Start your learning journey in the official documentation, and review the linked resources, for example, the Python tutorial. The library and language reference documentation can be helpful, too.

Tip: When I touched base with Python in 2005, I did not have many use cases except as a framework to test Windows 2000 drivers. Later, in 2016, I refreshed my knowledge with the book "Head First Python, 2nd Edition," providing great practical examples for the best learning experience – two weeks later, I could explain the differences between Python 2 and 3. You do not need to worry about Python 2 – it has been deprecated some years ago, and we will focus only on Python 3 in this blog post. In August 2023, "Head First Python, 3rd Edition" was published. The book provides a great learning resource, along with the exercises shared in this blog post.

Create a new file hello.py in the root directory of the project and start with a comment saying # Hello world. Review and accept the suggestion by pressing the tab key and save the file (keyboard shortcut: cmd s).

# Hello world

Commit the change to the Git repository. In VS Code, use the keyboard shortcut ctrl shift G, add a commit message, and hit cmd enter to submit.

Use the command palette (cmd shift p) and search for create terminal to open a new terminal. Run the code with the Python interpreter. On macOS, the binary from Homebrew is called python3, other operating systems and distributions might use python without the version.

python3 hello.py

Hello World, hello GitLab Duo Code Suggestions

Tip: Adding code comments in Python starting with the # character before you start writing a function or algorithm will help Code Suggestions with more context to provide better suggestions. In the example above, we did that with # Hello world, and will continue doing so in the next exercises.

Add hello.py to Git, commit all changes and push them to your GitLab project.

git add hello.py

git commit -avm "Initialize Python"

git push

The source code for all exercises in this blog post is available in this "Learn Python with AI" project.

Start learning Python with a practical example

The learning goal in the following sections involves diving into the language datatypes, variables, flow control, and functions. We will also look into file operations, string parsing, and data structure operations for printing the results. The exercises will help build a command-line application that reads different log formats, works with the data, and provides a summary. This will be the foundation for future projects that fetch logs from REST APIs, and inspire more ideas such as rendering images, creating a web server, or adding Observability metrics.

Parsing log files into structured objects, example result after following the exercises

As an experienced admin, you can put the script into production and use real-world log format exmples. Parsing and analyzing logs in stressful production incidents can be time-consuming. A local CLI tool is sometimes faster than a log management tool.

Let's get started: Create a new file called log_reader.py in the directory root, add it to Git, and create a Git commit.

Define variables and print them

As a first step, we need to define the log files location, and the expected file suffix. Therefore, let's create two variables and print them. Actually, ask Code Suggestions to do that for you by writing only the code comments and accepting the suggestions. Sometimes, you need to experiment with suggestions and delete already accepted code blocks. Do not worry – the quality of the suggestions will improve over time as the model generates better suggestions with more context.

Define log path and file suffix variables

Print the variables to verify

# Specify the path and file suffix in variables
path = '/var/log/'
file_suffix = '.log'

# Print the variables 

print(path)
print(file_suffix)

Navigate into the VS Code terminal and run the Python script:

python3 log_reader.py

VS Code terminal, printing the variables

Python supports many different types in the standard library. Most common types are: Numeric (int, float, complex), Boolean (True, False), and String (str). Data structures include support for lists, tuples, and dictionaries.

Explore variable types

To practice different variable types, let's define a limit of log files to read as a variable with the integer type.

Log file variable

# Define log file limit variable 
log_file_limit = 1024 

Create a Boolean variable that forces to read all files in the directory, no matter the log file suffix.

# Define boolean variable whether to read all files recursively
read_all_files_recursively = True

File I/O: Read and print a log file

Create a directory called log-data in your project tree. You can copy all file examples from the log-data directory in the example project.

Create a new file sample.log with the following content, or any other two lines that provide a different message at the end.

Oct 17 00:00:04 ebpf-chaos systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Oct 17 00:00:04 ebpf-chaos systemd[1]: Finished Daily dpkg database backup service.

Instruct Code Suggestions to read the file log-data/sample.log and print the content.

Code Suggestions: Read log file and print it

# Read the file in log-data/sample.log and print its content
with open('log-data/sample.log', 'r') as f:
    print(f.read())

Tip: You will notice the indent here. The with open() as f: statement opens a new scope where f is available as stream. This flow requires indenting )tab) the code block, and perform actions in this scope, calling f.read() to read the file contents, and passing the immediate value as parameter into the print() function.

Navigate into the terminal, and run the script again with python3 log_reader.py. You will see the file content shown in the VS Code editor, also printed into the terminal.

VS Code terminal: Read log file, and print it

Flow control

Reading one log file is not enough – we want to analyze all files in a given directory recursively. For the next exercise, we instruct Code Suggestions to create an index of all files.

Prepare the log-data directory with more example files from the log-data directory in the example project. The directory tree should look as follows:

tree log-data                                                             ─╯
log-data
├── sample.log
└── var
    └── log
        ├── auth.log
        ├── syslog.log
        └── syslog_structured.log

3 directories, 4 files

Loops and lists to collect files

Modify the path variable to use the value log-data/.

# Specify the path and file suffix in variables
path = 'log-data/'
file_suffix = '.log'

Tell Code Suggestions to read all file paths in the directory into a list. After the collection loop, print the list of file paths.

# Read all file paths in the directory into a list

# Print the list of log file paths

Code Suggestion, collect file paths

An example of a suggestion could look like this:

# Read all file paths in the directory into a list
import os

# Read all file paths in the directory into a list
log_files = []
for root, directories, files in os.walk(path):
    for file in files:
        if file.endswith(file_suffix):
            log_files.append(os.path.join(root, file))

# Print the list of log file paths

print(log_files)

Let's analyze what happens here: The import os statement is required to bring the os library into the current scope, and we are able to call os.walk() later. The log_files list is initialized as an empty list using empty brackets. os.walk() expects a path as directory tree to start searching for files. The function returns a generator object, which will walk the directory tree recursively. The returned tuples include the collected files, which we want to access in a loop iteraor, using the for loop notation. When the file ends with the defined suffix, its full path is appended to the list. This is an optimized and complex example. Do not worry about understanding everything just yet - we will revisit this later.

Conditionally collect files

Modify the instructions and add the Boolean flag evaluation for read_all_files_recursively as an input to the glob() function.

# Read all file paths in the top level directory into a list
# If read_all_files_recursively is True, read all files recursively

Code Suggestions, read files recursively conditionally

# Read all file paths in the top level directory 
import os

# Read all file paths in the top level directory into a list
# If read_all_files_recursively is True, read all files recursively

log_files = []
for file in os.listdir(path):
    if file.endswith(file_suffix):
        log_files.append(os.path.join(path, file))

if read_all_files_recursively:
    for root, directories, files in os.walk(path):
        for file in files:
            if file.endswith(file_suffix):               
                log_files.append(os.path.join(root, file))

# Print log_files
print(log_files)

The result is not optimal yet because it always executes the first loop, and optionally the second loop. This flow leads to duplicated results when the script is executed.

python3 log_reader.py

['log-data/sample.log', 'log-data/sample.log', 'log-data/var/log/auth.log']

Experiment with Code Suggestions instructions to get a solution for the problem. There are different approaches you can take:

  1. A potential solution is to wrap the source code into an if-then-else block, and move the os.listdir() loop into the else-block.
if read_all_files_recursively:
    for root, directories, files in os.walk(path):
        for file in files:
            if file.endswith(file_suffix):               
                log_files.append(os.path.join(root, file))
else:
    for file in os.listdir(path):
        if file.endswith(file_suffix):
            log_files.append(os.path.join(path, file))  

  1. Alternatively, do not use append() to always add a new list entry, but check if the item exists in the list first.
for file in os.listdir(path):
    if file.endswith(file_suffix):
        # check if the entry exists in the list already
        if os.path.isfile(os.path.join(path, file)):
            log_files.append(os.path.join(path, file))

if read_all_files_recursively:
    for root, directories, files in os.walk(path):
        for file in files:
            if file.endswith(file_suffix):
                # check if the entry exists in the list already
                if file not in log_files:
                    log_files.append(os.path.join(root, file))
  1. Or, we could eliminate duplicate entries after collecting all items. Python allows converting lists into sets, which hold unique entries. After applying set(), you can again convert the set back into a list. Code Suggestions knows about this possibility, and will help with the comment # Ensure that only unique file paths are in the list

Code Suggestions, converting a list to unique items

# Ensure that only unique file paths are in the list

log_files = list(set(log_files))
  1. Take a step back and evaluate whether the variable read_all_files_recursively makes sense. Maybe the default behavior should just be reading all files recursively?

Tip for testing different paths in VS Code: Select the code blocks, and press cmd / on macOS to comment out the code.

Functions

Let's create a function called parse_log_file that parses a log file, and returns the extracted data. We will define the expected log format and columns to extract, following the syslog format specification. There are different log format types and also customized formats by developers that need to be taken into account – exercise for later.

Start with a simple log format

Inspect a running Linux VM, or use the following example log file example for additional implementation.

less /var/log/syslog | grep -v docker 

Oct 17 00:00:04 ebpf-chaos systemd[1]: Starting Daily dpkg database backup service...
Oct 17 00:00:04 ebpf-chaos systemd[1]: Starting Rotate log files...
Oct 17 00:00:04 ebpf-chaos systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Oct 17 00:00:04 ebpf-chaos systemd[1]: Finished Daily dpkg database backup service.
Oct 17 00:00:04 ebpf-chaos systemd[1]: logrotate.service: Deactivated successfully.
Oct 17 00:00:04 ebpf-chaos systemd[1]: Finished Rotate log files.
Oct 17 00:17:01 ebpf-chaos CRON[727495]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

We can create an algorithm to split each log line by whitespaces, and then join the results again. Let's ask Code Suggestions for help.

# Split log line "Oct 17 00:00:04 ebpf-chaos systemd[1]: Finished Rotate log files." by whitespaces and save in a list

log_line = "Oct 17 00:00:04 ebpf-chaos systemd[1]: Finished Rotate log files."
log_line_split = log_line.split(" ")
print(log_line_split)

Run the script again to verify the result.

python3 log_reader.py

['Oct', '17', '00:00:04', 'ebpf-chaos', 'systemd[1]:', 'Finished', 'Rotate', 'log', 'files.']

The first three items are part of the datetime string, followed by the host, service, and remaining log message items. Let's practice string operations in Python as the next step.

String and data structure operations

Let's ask Code Suggestions for help with learning to join strings, and perform list operations.

  1. Join the first three items with a whitespace again.
  2. Keep host and service.
  3. Join the remaining variable item count into a string, separated with whitespaces, again.
  4. Store the identified column keys, and their respective values in a new data structure: dictionary.

Code suggestions for list items with string operations

python3 log_reader.py

# Array
['Oct', '17', '00:00:04', 'ebpf-chaos', 'systemd[1]:', 'Finished', 'Rotate', 'log', 'files.']

# Dictionary 
{'datetime': 'Oct 17 00:00:04', 'host': 'ebpf-chaos', 'service': 'systemd[1]:', 'message': ' ebpf-chaos systemd[1]: Finished Rotate log files.'}

A working suggestion can look like the following:

# Initialize results dictionary with empty values for datetime, host, service, message
# Loop over log line split 
# Join the first three list items as date string
# Item 4: host 
# Item 5: service
# Join the remaining items into a string, separated with whitespaces 
# Print the results after the loop 

results = {'datetime': '', 'host': '', 'service': '', 'message': ''}

for item in log_line_split:

    if results['datetime'] == '':
        results['datetime'] = ' '.join(log_line_split[0:3])

    elif results['host'] == '':
        results['host'] = log_line_split[3]

    elif results['service'] == '':
        results['service'] = log_line_split[4]

    else:
        results['message'] += ' ' + item

print(results)

The suggested algorithm loops over all log line items, and applies the same operation for the first three items. log_line_split[0:3] extracts a slice of three items into a new list. Calling join() on a separator character and passing the array as an argument joins the items into a string. The algorithm continues to check for not initialized values for host (Item 4) and service (Item 5)and concludes with the remaining list items appended into the message string. To be honest, I would have used a slightly different algorithm, but it is a great learning curve to see other algorithms, and ways to implement them. Practice with different instructions, and data structures, and continue printing the data sets.

Tip: If you need to terminate a script early, you can use sys.exit(). The remaining code will not be executed.

import sys 
sys.exit(1)

Imagine doing these operations for different log formats, and message types – it can get complicated and error-prone very quickly. Maybe there is another approach.

Parse log files using regular expressions

There are different syslog format RFCs – RFC 3164 is obsolete but still found in the wild as default configuration (matching the pattern above), while RFC 5424 is more modern, including datetime with timezone information. Parsing this format can be tricky, so let's ask Code Suggestions for advice.

In some cases, the suggestions include regular expressions. They might not match immediately, making the code more complex to debug, with trial and errors. A good standalone resource to text and explain regular expressions is regex101.com.

Tip: You can skip diving deep into regular expressions using the following code snippet as a quick cheat. The next step involves instructing Code Suggestions to use these log patterns, and help us extract all valuable columns.

# Define the syslog log format regex in a dictionary
# Add entries for RFC3164, RFC5424
regex_log_pattern = {
    'rfc3164': '([A-Z][a-z][a-z]\s{1,2}\d{1,2}\s\d{2}[:]\d{2}[:]\d{2})\s([\w][\w\d\.@-]*)\s(.*)$',
    'rfc5424': '(?:(\d{4}[-]\d{2}[-]\d{2}[T]\d{2}[:]\d{2}[:]\d{2}(?:\.\d{1,6})?(?:[+-]\d{2}[:]\d{2}|Z)?)|-)\s(?:([\w][\w\d\.@-]*)|-)\s(.*)$;'
}

We know what the function should do, and its input parameters – the file name, and a log pattern to match. The log lines should be split by this regular expression, returning a key-value dictionary for each log line. The function should return a list of dictionaries.

# Create a function that parses a log file
# Input parameter: file path
# Match log line against regex_log_pattern
# Return the results as dictionary list: log line, pattern, extracted columns

Code suggestion based on a multiline comment instruction to get a function that parses a log file based on regex patterns

Remember the indent for opening a new scope? The same applies for functions in Python. The def identifier requires a function name, and a list of parameters, followed by an opening colon. The next lines of code require the indent. VS Code will help with live-linting wrong indent, before the script execution fails, or the CI/CD pipelines.

Continue with Code Suggestions – it might already know that you want to parse all log files, and parse them using the newly created function.

Code suggestion to parse all log files, and print the result set

A full working example can look like this:

import os

# Specify the path and file suffix in variables
path = 'log-data/'
file_suffix = '.log'

# Read all file paths in the directory into a list
log_files = []
for root, directories, files in os.walk(path):
    for file in files:
        if file.endswith(file_suffix):
            log_files.append(os.path.join(root, file))

# Define the syslog log format regex in a dictionary
# Add entries for RFC3164, RFC5424
regex_log_pattern = {
    'rfc3164': '([A-Z][a-z][a-z]\s{1,2}\d{1,2}\s\d{2}[:]\d{2}[:]\d{2})\s([\w][\w\d\.@-]*)\s(.*)$',
    'rfc5424': '(?:(\d{4}[-]\d{2}[-]\d{2}[T]\d{2}[:]\d{2}[:]\d{2}(?:\.\d{1,6})?(?:[+-]\d{2}[:]\d{2}|Z)?)|-)\s(?:([\w][\w\d\.@-]*)|-)\s(.*)$;'
}

# Create a function that parses a log file
# Input parameter: file path
# Match log line against regex_log_pattern
# Return the results as dictionary list: log line, pattern name, extracted columns
import re

def parse_log_file(file_path):
    # Read the log file
    with open(file_path, 'r') as f:
        log_lines = f.readlines()

    # Create a list to store the results
    results = []

    # Iterate over the log lines
    for log_line in log_lines:
        # Match the log line against the regex pattern
        for pattern_name, pattern in regex_log_pattern.items():
            match = re.match(pattern, log_line)

            # If the log line matches the pattern, add the results to the list
            if match:
                extracted_columns = match.groups()
                results.append({
                    'log_line': log_line,
                    'pattern_name': pattern_name,
                    'extracted_columns': extracted_columns,
                    'source_file': file_path
                })

    # Return the results
    return results

# Parse all files and print results
for log_file in log_files:
    results = parse_log_file(log_file)
    print(results)

Let's unpack what the parse_log_file() function does:

  1. Opens the file from file_path parameter.
  2. Reads all lines into a new variable log_lines.
  3. Creates a results list to store all items.
  4. Iterates over the log lines.
  5. Matches against all regex patterns configured in regex_log_pattern.
  6. If a match is found, extracts the matching column values.
  7. Creates a results item, including the values for the keys log_line, pattern_name, extracted_colums, source_file.
  8. Appends the results item to the results list.
  9. Returns the results list.

There are different variations to this – especially for the returned result data structure. For this specific case, log lines come as list already. Adding a dictionary object instead of a raw log line allows function callers to extract the desired information in the next step. Once a working example has been implemented, you can refactor the code later, too.

Advanced log format: auth.log

Parsing the syslog on a Linux distribution might not unveil the necessary data to analyze. On a virtual machine that exposes port 22 (SSH) to the world, the authentication log is much more interesting – plenty of bots and malicious actors testing default password combinations and often brute force attacks.

The following snippet from /var/log/auth.log on one of my private servers shows the authentication log format and the random attempts from bots using different usernames, etc.

Oct 15 00:00:19 ebpf-chaos sshd[3967944]: Failed password for invalid user ubuntu from 93.254.246.194 port 48840 ssh2
Oct 15 00:00:20 ebpf-chaos sshd[3967916]: Failed password for root from 180.101.88.227 port 44397 ssh2
Oct 15 00:00:21 ebpf-chaos sshd[3967944]: Received disconnect from 93.254.246.194 port 48840:11: Bye Bye [preauth]
Oct 15 00:00:21 ebpf-chaos sshd[3967944]: Disconnected from invalid user ubuntu 93.254.246.194 port 48840 [preauth]
Oct 15 00:00:24 ebpf-chaos sshd[3967916]: Failed password for root from 180.101.88.227 port 44397 ssh2
Oct 15 00:00:25 ebpf-chaos sshd[3967916]: Received disconnect from 180.101.88.227 port 44397:11:  [preauth]
Oct 15 00:00:25 ebpf-chaos sshd[3967916]: Disconnected from authenticating user root 180.101.88.227 port 44397 [preauth]
Oct 15 00:00:25 ebpf-chaos sshd[3967916]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=180.101.88.227  user=root
Oct 15 00:00:25 ebpf-chaos sshd[3967998]: Invalid user teamspeak from 185.218.20.10 port 33436

Tip for intrusion prevention: Add a firewall setup, and use fail2ban to block invalid auth logins.

The next exercise is to extend the logic to understand the free form log message parts, for example Failed password for invalid user ubuntu from 93.254.246.194 port 48840 ssh2. The task is to store the data in an optional dictionary with key value pairs.

Create a new function that takes the previously parsed log line results as input, and specifically parses the last list item for each line.

  1. Count the number of Failed password and Invalid user messages.
  2. Return the results with count, log file, pattern

Code suggestions for a log file message parser to count auth.log failures

A working suggestion can look like the following code:

# Create a function that parses a log file message from the last extracted_columns entry 
# Input: Parsed log lines results list 
# Loop over all log lines in the list, and extract the last list item as message 
# Count failure strings in the message: Failed password, Invalid user 
# Return the results if failure count greater 0: log_file, count, failure string
def parse_log_file_message(results):
    failure_results = []

    # Iterate over the log lines
    for result in results:
        # Extract the message from the last list item
        message = result['extracted_columns'][-1]

        # Count the number of failure strings in the message
        failure_count = message.count('Failed password') + message.count('Invalid user')

        # If the failure count is greater than 0, add the results to the list
        if failure_count > 0:
            failure_results.append({
                'log_file': result['source_file'],
                'count': failure_count,
                'failure_string': message
            })

    # Return the results
    return failure_results

# Parse all files and print results
for log_file in log_files:
    results = parse_log_file(log_file)
    failure_results = parse_log_file_message(results)
    print(failure_results)

The algorithm follows the previous implementations: First, create a results array to store matching data. Then, iterate over the already parsed log_lines in the list. Each log line contains the extracted_columns key, which holds the free-form message string at the end. The next step is to call the string object function count() to count how many times a given character sequence is contained in a string. The returned numbers are added up to the failure_count variable. If it is greater than zero, the result is added to the results list, including the log_file, count and failure_string key-value pairs. After returning the parsed log message results, loop through all log files, parse them, and print the results again.

Execute the script to inspect the detected matches. Note that the data structure can be optimized in future learning steps.

python3 log_reader.py

[{'log_file': 'log-data/var/log/auth.log', 'count': 1, 'failure_string': 'sshd[3967944]: Failed password for invalid user ubuntu from 93.254.246.194 port 48840 ssh2'}, {'log_file': 'log-data/var/log/auth.log', 'count': 1, 'failure_string': 'sshd[3967916]: Failed password for root from 180.101.88.227 port 44397 ssh2'}, {'log_file': 'log-data/var/log/auth.log', 'count': 1, 'failure_string': 'sshd[3967916]: Failed password for root from 180.101.88.227 port 44397 ssh2'}, {'log_file': 'log-data/var/log/auth.log', 'count': 1, 'failure_string': 'sshd[3967998]: Invalid user teamspeak from 185.218.20.10 port 33436'}, {'log_file': 'log-data/var/log/auth.log', 'count': 1, 'failure_string': 'sshd[3967998]: Failed password for invalid user teamspeak from 185.218.20.10 port 33436 ssh2'}, {'log_file': 'log-data/var/log/auth.log', 'count': 1, 'failure_string': 'sshd[3968077]: Invalid user mcserver from 218.211.33.146 port 50950'}]

Parsing more types: Structured logging

Application developers can use the structured logging format to help machine parsers to extract the key value pairs. Prometheus provides this information in the following structure in syslog:

Oct 17 19:00:10 ebpf-chaos prometheus[594]: ts=2023-10-17T19:00:10.425Z caller=compact.go:519 level=info component=tsdb m
sg="write block" mint=1697558404661 maxt=1697565600000 ulid=01HCZG4ZX51GTH8H7PVBYDF4N6 duration=148.675854ms
Oct 17 19:00:10 ebpf-chaos prometheus[594]: ts=2023-10-17T19:00:10.464Z caller=head.go:1213 level=info component=tsdb msg
="Head GC completed" caller=truncateMemory duration=6.845245ms
Oct 17 19:00:10 ebpf-chaos prometheus[594]: ts=2023-10-17T19:00:10.467Z caller=checkpoint.go:100 level=info component=tsd
b msg="Creating checkpoint" from_segment=2308 to_segment=2309 mint=1697565600000
Oct 17 19:00:10 ebpf-chaos prometheus[594]: ts=2023-10-17T19:00:10.517Z caller=head.go:1185 level=info component=tsdb msg
="WAL checkpoint complete" first=2308 last=2309 duration=50.052621ms

This format is easier to parse for scripts, because the message part can be split by whitespaces, and the assignment character =. Strings that contain whitespaces are guaranteed to be enclosed with quotes. The downside is that not all programming language libraries provide ready-to-use structured logging libraries, making it harder for developers to adopt this format.

Practice following the previous example to parse the auth.log format with additional information. Tell Code Suggestions that you are expecting structured logging format with key-value pairs, and which returned data structure would be great:

# Create a function that parses a log file message from the last extracted_columns entry 
# Input: Parsed log lines results list 
# Loop over all log lines in the list, and extract the last list item as message 
# Parse structured logging key-value pairs into a dictionary
# Return results: log_file, dictionary 

Code suggestions for parsing structured logging format in the log file message part

Printing results and formatting

Many of the examples used the print() statement to print the content on the terminal. Python objects in the standard library support text representation, and for some types it makes more sense (string, numbers), others cannot provide much details (functions, etc.).

You can also pretty-print almost any data structure (lists, sets, dictionaries) in Python. The JSON library can format data structures in a readable format, and use a given spaces indent to draw the JSON structure on the terminal. Note that we use the import statement here to bring libraries into the current scope, and access their methods, for example json.dumps.

import json 
print(json.dumps(structured_results, indent=4))

Parsing log files into structured objects, example result after following the exercises

Practice with modifying the existing source code, and replace the code snippets where appropriate. Alternatively, create a new function that implements pretty printing.

# Create a pretty print function with indent 4 

Code suggestions for pretty-print function

This idea works in a similar fashion with creating your own logger functions...but we have to stop learning and take a break. Before we conclude the first blog post in the learning series, let's ensure that CI/CD and dependencies are set up properly for future exercises and async practice.

Dependency management and continuous verification

Pip and pyenv: Bringing structure into Python

Dependencies can be managed in the requirements.txt file, including optional version dependencies. Using requirements.txt file also has the advantage of being the single source of truth for local development environments and running continuous builds with GitLab CI/CD. They can use the same installation command:

pip install -r requirements.txt

Some Linux distributions do not install the pip package manager by default, for example, Ubuntu/Debian require to install the python3-pip package.

You can manage different virtual environments using venv. This workflow can be beneficial to install Python dependencies into the virtual environment, instead of globally into the OS path which might break on upgrades.

pip install virtualenv
virtualenv venv
source venv/bin/activate 

Automation: Configure CI/CD pipeline for Python

The CI/CD pipeline should continuously lint, test, and build the code. You can mimic the steps from the local development, and add testing more environments and versions:

  1. Lint the source code and check for formatting errors. The example uses Pyflakes, a mature linter, and Ruff, a fast linter written in Rust.
  2. Cache dependencies installed using the pip package manager, following the documentation for Python caching in GitLab CI/CD. This saves time and resources on repeated CI/CD pipeline runs.
  3. Use parallel matrix builds to test different Python versions, based on the available container images on Docker Hub and their tags.
stages:
  - lint
  - test

default:
  image: python:latest
  cache:                      # Pip's cache doesn't store the python packages
    paths:                    # https://pip.pypa.io/en/stable/topics/caching/
      - .cache/pip
  before_script:
    - python -V               # Print out python version for debugging
    - pip install virtualenv
    - virtualenv venv
    - source venv/bin/activate

variables:  # Change pip's cache directory to be inside the project directory since we can only cache local items.
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

# lint template
.lint-tmpl:
  script:
    - echo "Linting Python version $VERSION"
  parallel:
    matrix:
      - VERSION: ['3.9', '3.10', '3.11', '3.12']   # https://hub.docker.com/_/python

# Lint, using Pyflakes: https://pypi.org/project/pyflakes/ 
lint-pyflakes:
  extends: [.lint-tmpl]
  script:
    - pip install -r requirements.txt
    - find . -not -path './venv' -type f -name '*.py' -exec sh -c 'pyflakes {}' \;

# Lint, using Ruff (Rust): https://docs.astral.sh/ruff/ 
lint-ruff:
  extends: [.lint-tmpl]
  script:
    - pip install -r requirements.txt
    - ruff .

GitLab CI/CD Python lint job view, part of matrix builds

What is next

Fun fact: GitLab Duo Code Suggestions also helped writing this blog post in VS Code, knowing about the context. In the screenshot, I just wanted to add a tip about regex101, and GitLab Duo already knew.

Writing the GitLab blog post in VS Code with support from GitLab Duo Code Suggestions

In an upcoming blog, we will look into advanced learning examples with more practical (log) filtering and parallel operations, how to fetch logs from API endpoints (CI/CD job logs for example), and more data analytics and observability. Until then, here are a few recommendations for practicing async.

Async learning exercises

  • Implement the missing log_file_limit variable check.
  • Print a summary of the results in Markdown, not only JSON format.
  • Extend the script to accept a search filter as environment variable. Print/count only filtered results.
  • Extend the script to accept a date range. It might require parsing the datetime column in a time object to compare the range.
  • Inspect a GitLab CI/CD pipeline job log, and download the raw format. Extend the log parser to parse this specific format, and print a summary.

Share your feedback

Which programming language are you learning or considering learning? Start a new topic on our community forum or Discord and share your experience.

When you use GitLab Duo Code Suggestions, please share your thoughts and feedback in the feedback issue.

We want to hear from you

Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum. Share your feedback

Ready to get started?

See what your team could do with a unified DevSecOps Platform.

Get free trial

Find out which plan works best for your team

Learn about pricing

Learn about what GitLab can do for your team

Talk to an expert