1
Write a post

Autogenerating blog posts from your project's commit history with python

Published Jul 30, 2017Last updated Aug 01, 2017
Autogenerating blog posts from your project's commit history with python

Source:

source code

In this post, we will look at how blogplish (blog autopublish) was written. The script this post describes wrote this post (from it's own commit messages). It sounds confusing...let's just write it again.

Viewing each commit:

You can also follow along by viewing the entire file for each commit.

Clone the repo.

For each commit, do git checkout {commit_number} (git only needs the first 6 characters of a commit id):

$ git clone https://github.com/codyc4321/blogplish.git
$ cd blogplish
$ git checkout b37ae0
Previous HEAD position was f000c4e... Now step 2, adding the diff of each file that was changed:
HEAD is now at b37ae03... Let's start the project by making a python script called , which has one line that prints a message;

You can see the code at that stage, and the description in each commit message.

You can leave detached head and check out the final commit by

git checkout master

The commit numbers are given next to each version of the file in this post, like

blogplish.py commit_id = abc123

Blogplish

Make a new git repo. Make a file called blogplish.py, using touch blogplish.py.

$ mkdir blogplish
$ cd blogplish
$ git init
$ touch blogplish.py

Add only the code print("The script is working.") in this script.

You can run this file using the python command in your terminal:

$ python blogplish.py
The script is working.

blogplish.py commit_id = b37ae0


print("The script is working.")

The goal here is that Blogplish will write most of this post you're reading by parsing it's own commits, and its own file contents after each commit, and generating the necessary markdown to be cut and paste in here.

In the future, it should be able to autogenerate a blog post from any project, when given the path to that project's git folder, and it should be interactive. For now, blogplish will write a narrative of its own creation.

When you're working on a task you've never done before, most of your time is usually spent figuring out what it is you need to do. In this case, we found ourselves stuck planning out how to write the script (Do we use regular Bash, or the Github API?, Do we get all commits at once using git log, or go backwards getting 1 commit at a time using the HEAD~1 style syntax until there's no commits left?, and so on).

When you get that stuck, it's best to start out writing pseudocode and describe what you think you need to do overall:

blogplish.py commit_id = 536ec9


print("The script is working.")

"""
Write function to call Bash command from Python

Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""

Notice that triple-quoted strings are multiline comments in python.

As far as the first step "Write function to call Bash command from Python" goes, I already had a sturdy function to flexibly run linux commands in a python script:

blogplish.py commit_id = 8efea85

from subprocess import Popen, PIPE


"""
Write function to call Bash command from Python

Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""


def call_sp(command, *args, **kwargs):
    """ you can run command from any directory you want by passing in a kwarg of 'cwd' (current working directory):

        call_sp('ls -a', cwd='/home/username/projects/awesomeproject')
    """
    if args:
        command = command.format(*args)
    p = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs)
    output, err = p.communicate()
    return output, err


output, error = call_sp('ls')
print(output)

I don't remember how call_sp was built besides a lot of begging people to do my work for me on Stackoverflow, but you can read about running subprocesses (running terminal commands in python) here.

Run 'ls' in your terminal, and see it outputs only blogplish.py, the only file in our project (besides hidden .git files). In the python script, it also runs ls, in the call_sp('ls') portion. The output here should match:

cchilders:~/blogplish (master)    $ python blogplish.py    
blogplish.py        
cchilders:~/blogplish (master)    $ ls    
blogplish.py

Update our pseudocode:

blogplish.py commit_id = bb19fc

...

"""
Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""
...

As for Get all commit info, we can use git log for that. Update call_sp:

blogplish.py commit_id = ea270e

...

output, error = call_sp('git log')
print(output)

Now run our script again: python blogplish.py.

You can compare the output with git log in your terminal. You should see a summary of your commit history like:

cchilders:~/blogplish (master)    
$ python blogplish.py    
commit ea270e9a879b385580a855f1f83736ccce345de3    
Author: Cody Childers <email@example.com>    
Date:   Sun Jul 30 00:06:03 2017 -0500            
     
     As for `Get all commit info`, we can use `git log` for that. 
     Update `call_sp`:
     
commit bb19fca5f6461fbf8ca6e1870964021f818ba063    
Author: Cody Childers <email@example.com>    
Date:   Sun Jul 30 00:00:11 2017 -0500            

    Update our pseudocode:        
    
...etc...

As before, the output of our script and the output of running git log in terminal should be identical.

Next, we need to parse the output of git log. Look at what it outputs and take a few minutes to think about how you'd parse it to get the commit ID and message for each commit. We can ignore Author and Date...the git log outputs commits in order, our script will just start from the first commit.

Start a function to do the parsing:

blogplish.py commit_id = 182ec6

from subprocess import Popen, PIPE


"""
Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""

def call_sp(command, *args, **kwargs):
    """ you can run command from any directory you want by passing in a kwarg of 'cwd' (current working directory):

        call_sp('ls -a', cwd='/home/username/projects/awesomeproject')
    """
    if args:
        command = command.format(*args)
    p = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs)
    output, err = p.communicate()
    return output, err


def parse_git_log_info(text_output):
    pass


output, error = call_sp('git log')
print(output)

At first we tried this:

blogplish.py commit_id = d5bdbdf

import re
from subprocess import Popen, PIPE

...

def parse_git_log_info(text_output):
    # https://stackoverflow.com/questions/10974932/split-string-based-on-a-regular-expression
    commits_array = re.split("commit \w{40}", text_output)
    print(commits_array)


output, error = call_sp('git log')
print(output)

parse_git_log_info(output)

But the problem was, it was cutting off the commit id:

[
  '', "\nAuthor: Cody Childers <email@example.com>\nDate:   Sun Jul 30 00:15:39 2017 -0500\n\n    Next, blah blah...\n", 
  '\nAuthor: ...
]

We were able to split the git log output by using re.findall. The re package is a python pattern matcher, that allows you to find text of interest. The easiest way to write regexes is to go to pythex.org.

1_using_pythex.png

The time module introduces a pause as we looked for a list of approximately 10-15 commits:

blogplish.py commit_id = 60e1fa

import re
import time
from subprocess import Popen, PIPE

...

def parse_git_log_info(text_output):
    # https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python
    # https://stackoverflow.com/questions/1870954/python-regular-expression-across-multiple-lines
    rgx = re.compile(r"commit \w{40}.*?(?=commit)", re.DOTALL)
    commits_array = re.findall(rgx, text_output)
    print(len(commits_array))
    time.sleep(3)
    for item in commits_array:
        print(item)
        print('\n\n\n\n')


output, error = call_sp('git log')

parse_git_log_info(output)

The r"commit \w{40}.*?(?=commit)" is a pattern matcher, called a regular expression.

First we tried the regex r"commit \w{40}.*(?=commit)" which was broken, because .* literally matches everything, skipping the next commit word and matching the entire text. The .*? will allow the regex to stop short and only match from commit to the next commit word.

You can now run python blogplish.py to see how this parser works.

Regexes

A general overview would be:

^ = matches at the real beginning of the string and at positions just after a newline
\w = match any 'word': letters a-z and A-Z, numbers 0-9, and underscore _
\d = match any number 0-9
\s = match any whitespace character, like space, newline (\n), tab (\t), and even \r, \f, and \v
{2, 4} = match from 2 to 4 characters of the thing preceding me
{40} = match exactly 40 characters of the thing preceding me
[afg] = match either a, f, or g
(?=commit) = match only when the word 'commit' comes next
. = match any single character
* = match as many as possible, from 0-infinity characters
? = match 0 or 1 (makes an item optional)
.* = match everything from here to the end of the line (until we hit th\n), unless the re.DOTALL flag is active, in that case match till the end of the text string
.*? = match everything, but try to stop the match as soon as possible

Notice the re.DOTALL flag above. This allows our .*? to also match past each new line (\n), whereas without the re.DOTALL any .* type regex won't match over multiple lines.

Thus our regex r"commit \w{40}.*?(?=commit)" means "match the word commit, followed by a space, followed by 40 letters or numbers, and match everything else after that until we hit another word 'commit', then stop, and don't capture the second commit word we hit".

A regex has a difference between "capturing" and "matching". Capturing means the portion will also be returned by the function, whereas non-capturing matches won't be included in the output. By default, a match is also captured. But there are many matches you can use that don't capture, as we used a positive lookahead assertion to match the second commit, but not capture it in our output:

# positive lookahead assertion
(?=dog) == only match where dog comes next, but don't capture 'dog'

# negative lookahead assertion
(?!dog) == only match where dog doesn't come next. Don't capture 'dog'

# positive lookbehind assertion
(?<=dog) == only match where dog came previously. Don't capture 'dog'

# negative lookbehind assertion
(?<!dog) == only match where dog didn't come previously. Don't capture 'dog'

You can find these ones here, by searching for "positive lookbehind assertion".

Some examples would be

Regex:
  r"dog(?=fish)"
Test string:
  "dogfish"
Result:
  Finds a match because "fish" came after "dog", and returns "dog"

Regex:
  r"book(?!shelf)"
Test string:
  "I like books. Books are good, but much less intellectual than Twitter. I replaced all the books on my bookshelf with printouts of my favorite tweets."
Result:
  If run through re.findall, matches the "book" portion of "I like books" because "shelf" doesn't come after "book". Does not match "Books" because it has a capital "B", whereas our regex doesn't. Matches "book" of "all the books on". Does not match "bookshelf" because shelf comes after "book".
  
Regex:
  r"(?<=dog)Fish"
Test string:
  "Fish is tasty"
Result:
  Doesn't match, because "dog" didn't come before "Fish". Returns None
  
Regex:
  r"(?<!dog)fish"
Test string:
  "There are 117 preeminent scientists are concerned about how the rapid advance in science will be used after the successful creation of the first 'dogfish'. Not cute like a dog, and not tasty like a fish, the dogfish is a monstrosity that will be used as a living mascot of Quididdle, a startup that connects startups with other startups to help them startup."
Result:
  Doesn't match any instance of "dogfish" because "dog" comes before "fish". Matches "fish" from "like a fish" because "dog" didn't come before "fish".

For some old-fashioned regex nerdfun, fire up http://pythex.org/ and input the following test string in "Your test string":

There are 117 preeminent scientists are concerned about how the rapid advance in science will be used after the successful creation of the first 'dogfish'. Not cute like a dog, and not tasty like a fish, the dogfish is a monstrosity that will be used as a living mascot of Quididdle, a startup that connects startups with other startups to help them startup.

Now input the following regexes (from each line not beginning in #) one by one, and read the matches:

# match 6 word characters in a row
\w{6}

# match 'There', a space, and 1 word character
There\s\w

# match 'There', a space, and any number of consecutive word characters
There\s\w*

# match 'are' and a space
are\s

# match 'are', a space, and exactly 2 number characters
are\s\d{2}

# match any lowercase letter or space, but only if there's 50 in a row
[a-z\s]{50}

Now we will return to our code.

Back to blogplish

This seemed like a great start, but we soon noticed that the commit messages where we had copypasted the output of git log as an example to used in the tutorial broke our parse_git_log_info function, because they also matched commit \w{40}:

...        
commit 3e4aca9f102229c890ef73967f4a4c1c61a51a73    
Author: Cody Childers <email@example.com>    
Date:   Sun Jul 30 00:08:46 2017 -0500            

    Now run our script again. 
    You can compare the output with `git log` in your terminal. 
    You should see a summary of your commit history like:            
   
    cchilders:~/blogplish (master)        
    $ python blogplish.py        
    commit ea270e9a879b385580a855f1f83736ccce345de3        
    Author: Cody Childers <email@example.com>        
    Date:   Sun Jul 30 00:06:03 2017 -0500                
    
        As for `Get all commit info`, we can use `git log` for that. 
        Update `call_sp`:            
            commit bb19fca5f6461fbf8ca6e1870964021f818ba063        
            Author: Cody Childers <email@example.com>        
            Date:   Sun Jul 30 00:00:11 2017 -0500        
            
                ...commit message here...

This makes sense, as we were writing the first draft of this post in our commit messages as we wrote blogplish, since that's the entire purpose of blogplish. But it means we can't parse our commits that way.

This threw a wrench in our plan of 1 distinct function to split the commits into an array, and another function to parse each singular commit one by one. Instead, we decided on a rambling parser that will parse the entire output line by line. Hideous, but works:

blogplish.py commit_id = 4b6aee7

...

def parse_git_log_info(text_output):
    commit_count = 0
    commit_start_rgx = r"^commit \w{40}"
    lines = text_output.split('\n')
    # commits_array = []
    current_commit_string = ""
    for line in lines:
        match = re.match(commit_start_rgx, line)
        if match:
            commit_count += 1
            print(line + " matched the start of a commit")
    print("\n")
    print(commit_count)
    # return commits_array


output, error = call_sp('git log')

parse_git_log_info(output)

The final parser:

blogplish.py commit_id = f09766

...


def parse_git_log_info(text_output):
    """ returns a commits_array like:

        [
            {'commit_id': '23hj3sz...', 'message': 'cleanup cruft'},
            {'commit_id': 'df8dje...', 'message': 'Changed paypal api setting to...'},
            ...
        ]
    """
    commit_start_rgx = r"^commit (?P<commit_id>\w{40})"
    lines = text_output.split('\n')
    commits_array = []
    current_commit_id = None
    current_commit_message_string = ""

    for line in lines:
        match = re.match(commit_start_rgx, line)
        if match:
            # this if block fails only once, on the first pass through
            if current_commit_id:
                commits_array.append({'commit_id': current_commit_id, 'message': current_commit_message_string.strip()})
            current_commit_id = match.group('commit_id')
            current_commit_message_string = ""
        else:
            if not line.startswith('Author: ') and not line.startswith('Date: '):
                current_commit_message_string += line

    return commits_array


output, error = call_sp('git log')

print(parse_git_log_info(output))

This parser goes line by line, checking if the line starts a new commit block or not:

match = re.match(commit_start_rgx, line)

If not, the parser adds the line to the commit message if applicable (if it doesn't start with 'commit', 'Author: ', or 'Date: '). If the line does match "^commit (?P<commit_id>\w{40})", it will add the data to the final results if the data is ready (except on the first go around, where we have current_commit_id initialized to None).

While it isn't as clean looking as smaller parsers, I always find this line-by-line style to be less error prone for tricky text parsing.

Now, let's work on the Add entire files that were changed to final string part. To do this, we want to first find the files that were changed in each commit:

blogplish.py commit_id = 214edf

...

"""
Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""

...

def get_files_that_were_changed_in_commit(commit_id):
    # "get files that were changed in a commit": https://stackoverflow.com/questions/424071/how-to-list-all-the-files-in-a-commit
    output, error = call_sp('git diff-tree --no-commit-id --name-only -r %s' % commit_id)
    if error:
        raise Exception("Error in get_files_that_were_changed_in_commit():\n\n" + error)
    return output.split('\n')


output, error = call_sp('git log')

parsed_commits = parse_git_log_info(output)

first_commit = parsed_commits[0]
first_commit_id = first_commit['commit_id']

changed_files = get_files_that_were_changed_in_commit(first_commit_id)
print(changed_files)

We have a small issue however, as the output is ['blogplish.py', '']. We can prune empty lines out of our result using a list comprehension:

blogplish.py commit_id = 1b9bb4


def get_files_that_were_changed_in_commit(commit_id):
    # "get files that were changed in a commit": https://stackoverflow.com/questions/424071/how-to-list-all-the-files-in-a-commit
    output, error = call_sp('git diff-tree --no-commit-id --name-only -r %s' % commit_id)
    if error:
        raise Exception("Error in get_files_that_were_changed_in_commit():\n\n" + error)
    changed_files_intermediary = output.split('\n')
    # at first got a result like ['blogplish.py', '']
    changed_files = [this_file for this_file in changed_files_intermediary if this_file]
    return changed_files

...

The [this_file for this_file in changed_files_intermediary if this_file] prunes out empty strings.

Now that we know which files were changed in any commit, we need to get the contents of the file at that point in time:

blogplish.py commit_id = f66b7b


...

def get_files_that_were_changed_in_commit(commit_id):
    ...


def get_contents_of_certain_file_in_certain_commit(commit_id, filename):
    # "get contents of a certain file in a commit": https://stackoverflow.com/questions/2497051/how-can-i-show-the-contents-of-a-file-at-a-specific-state-of-a-git-repo
    output, error = call_sp('git show %s:%s' % (commit_id, filename))
    if error:
        raise Exception("Error in get_contents_of_certain_file_in_certain_commit():\n\n" + error)
    return output


output, error = call_sp('git log')

parsed_commits = parse_git_log_info(output)

first_commit = parsed_commits[0]
first_commit_id = first_commit['commit_id']

changed_files = get_files_that_were_changed_in_commit(first_commit_id)

for changed_file in changed_files:
    contents = get_contents_of_certain_file_in_certain_commit(first_commit_id, changed_file)
    print(contents)

Now run python blogplish.py.

To double-check this, we used the first commit of the blogplish project:

blogplish.py commit_id = c4b7c7

...

output, error = call_sp('git log')

parsed_commits = parse_git_log_info(output)

first_commit = parsed_commits[0]
first_commit_id = first_commit['commit_id']

changed_files = get_files_that_were_changed_in_commit(first_commit_id)

# for changed_file in changed_files:
#     contents = get_contents_of_certain_file_in_certain_commit(first_commit_id, changed_file)
#     print(contents)

print(get_contents_of_certain_file_in_certain_commit('b37ae0371d1', 'blogplish.py'))

And got:

cchilders:~/blogplish (master)    
$ python blogplish.py        
print("The script is working.")    

It's working.

To get the diff of a file at a certain point in time, we use git diff {older_commit_id}..{newer_commit_id} {filename} syntax. Check out get_diff_of_certain_file_in_certain_commit():

blogplish.py commit_id = 7ff0eb

import re
import sys
from subprocess import Popen, PIPE

THIS_SCRIPT_NAME = sys.argv[0]

...


def get_contents_of_certain_file_in_certain_commit(commit_id, filename):
    ...


def get_diff_of_certain_file_in_certain_commit(newer_commit_id, older_commit_id, filename):
    # "get dif of a certain file in certain commit": https://stackoverflow.com/questions/42357521/generate-diff-file-of-a-specific-commit-in-git
    command = 'git diff {older_commit_id}..{newer_commit_id} {filename}'.format(older_commit_id=older_commit_id, newer_commit_id=newer_commit_id, filename=filename)
    raw_diff, error = call_sp(command)
    if error:
        raise Exception("Error in get_diff_of_certain_file_in_certain_commit():\n\n" + error)
    return raw_diff


a_diff_2_commits_back = get_diff_of_certain_file_in_certain_commit('c4b7c7cabccc350eef5ef80344f', 'f66b7bfd0f82d5b987d9f71f', THIS_SCRIPT_NAME)
print(a_diff_2_commits_back)

Run the script again. This manual check shows it's working.

We're finally ready to combine these 3 functions into an autogenerated markdown file for our blogpost. We started with this func and reviewed the commits data we first got:

blogplish.py commit_id = 07941b



def get_contents_of_certain_file_in_certain_commit(commit_id, filename):
    ...

def get_diff_of_certain_file_in_certain_commit(newer_commit_id, older_commit_id, filename):
   ...


def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    print(parsed_commits)

    # first_commit = parsed_commits[0]
    # first_commit_id = first_commit['commit_id']
    #
    # changed_files = get_files_that_were_changed_in_commit(first_commit_id)
    #
    # for changed_file in changed_files:
    #     contents = get_contents_of_certain_file_in_certain_commit(first_commit_id, changed_file)
    #     print(contents)
    #
    # print(get_contents_of_certain_file_in_certain_commit('b37ae0371d1', 'blogplish.py'))
    #
    # a_diff_2_commits_back = get_diff_of_certain_file_in_certain_commit('c4b7c7cabccc350eef5ef80344f', 'f66b7bfd0f82d5b987d9f71f', THIS_SCRIPT_NAME)
    # print(a_diff_2_commits_back)


auto_blogplish_blog()

While the commits come back in order of newest to oldest, we write tutorials from start to finish, so the order is backwards. Reversing a list in python is very easy:

blogplish.py commit_id = 27635c

...


def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    # "reverse a list python": https://stackoverflow.com/questions/3940128/how-can-i-reverse-a-list-in-python
    parsed_commits.reverse()
    print(parsed_commits)

    ...


auto_blogplish_blog()

Now we want to start iterating over the commit data, generating the text. The order will go

  1. commit message
  2. the diff of each file that was changed
  3. the total contents of each file that was changed

First, get the commit messages added in the correct order:

blogplish.py commit_id = 1922b8

import sys
from subprocess import Popen, PIPE

THIS_SCRIPT_NAME = sys.argv[0]


"""
Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""

def call_sp(command, *args, **kwargs):
    """ you can run command from any directory you want by passing in a kwarg of 'cwd' (current working directory):

        call_sp('ls -a', cwd='/home/username/projects/awesomeproject')
    """
    if args:
        command = command.format(*args)
    p = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs)
    output, err = p.communicate()
    return output, err


def parse_git_log_info(text_output):
    """ returns a commits_array like:

        [
            {'commit_id': '23hj3sz...', 'message': 'cleanup cruft'},
            {'commit_id': 'df8dje...', 'message': 'Changed paypal api setting to...'},
            ...
        ]
    """
    commit_start_rgx = r"^commit (?P<commit_id>\w{40})"
    lines = text_output.split('\n')
    commits_array = []
    current_commit_id = None
    current_commit_message_string = ""

    for line in lines:
        match = re.match(commit_start_rgx, line)
        if match:
            # this if block fails only once, on the first pass through
            if current_commit_id:
                commits_array.append({'commit_id': current_commit_id, 'message': current_commit_message_string.strip()})
            current_commit_id = match.group('commit_id')
            current_commit_message_string = ""
        else:
            if not line.startswith('Author: ') and not line.startswith('Date: '):
                current_commit_message_string += line

    return commits_array


def get_files_that_were_changed_in_commit(commit_id):
    # "get files that were changed in a commit": https://stackoverflow.com/questions/424071/how-to-list-all-the-files-in-a-commit
    output, error = call_sp('git diff-tree --no-commit-id --name-only -r %s' % commit_id)
    if error:
        raise Exception("Error in get_files_that_were_changed_in_commit():\n\n" + error)
    changed_files_intermediary = output.split('\n')
    # at first got a result like ['blogplish.py', '']
    changed_files = [this_file for this_file in changed_files_intermediary if this_file]
    return changed_files


def get_contents_of_certain_file_in_certain_commit(commit_id, filename):
    # "get contents of a certain file in a commit": https://stackoverflow.com/questions/2497051/how-can-i-show-the-contents-of-a-file-at-a-specific-state-of-a-git-repo
    output, error = call_sp('git show %s:%s' % (commit_id, filename))
    if error:
        raise Exception("Error in get_contents_of_certain_file_in_certain_commit():\n\n" + error)
    return output


def get_diff_of_certain_file_in_certain_commit(newer_commit_id, older_commit_id, filename):
    # "get dif of a certain file in certain commit": https://stackoverflow.com/questions/42357521/generate-diff-file-of-a-specific-commit-in-git
    command = 'git diff {older_commit_id}..{newer_commit_id} {filename}'.format(older_commit_id=older_commit_id, newer_commit_id=newer_commit_id, filename=filename)
    raw_diff, error = call_sp(command)
    if error:
        raise Exception("Error in get_diff_of_certain_file_in_certain_commit():\n\n" + error)
    return raw_diff


def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    # "reverse a list python": https://stackoverflow.com/questions/3940128/how-can-i-reverse-a-list-in-python
    parsed_commits.reverse()

    first_commit = parsed_commits[0]
    first_commit_id = first_commit['commit_id']

    for index, commit_data in enumerate(parsed_commits):
        blog_post += commit_data['message']
        blog_post += '\n\n\n\n'

    # changed_files = get_files_that_were_changed_in_commit(first_commit_id)

    # for changed_file in changed_files:
    #     contents = get_contents_of_certain_file_in_certain_commit(first_commit_id, changed_file)
    #     print(contents)

    # print(get_contents_of_certain_file_in_certain_commit('b37ae0371d1', 'blogplish.py'))

    # a_diff_2_commits_back = get_diff_of_certain_file_in_certain_commit('c4b7c7cabccc350eef5ef80344f', 'f66b7bfd0f82d5b987d9f71f', THIS_SCRIPT_NAME)
    # print(a_diff_2_commits_back)

    return blog_post


blog_text = auto_blogplish_blog()
print(blog_text)

We did step 3, add the total contents, second:

blogplish.py commit_id = 2a4a43

...


def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    # "reverse a list python": https://stackoverflow.com/questions/3940128/how-can-i-reverse-a-list-in-python
    parsed_commits.reverse()

    for index, commit_data in enumerate(parsed_commits):
        blog_post += commit_data['message']
        blog_post += '\n\n\n\n'
        commit_id = commit_data['commit_id']

        changed_files = get_files_that_were_changed_in_commit(commit_id)
        if changed_files:
            blog_post += '$$$ Entire contents of changed files: $$$\n\n'
        for changed_file in changed_files:
            contents = get_contents_of_certain_file_in_certain_commit(commit_id, changed_file)
            blog_post += '## ' + changed_file + ': ##\n\n'
            blog_post += contents
            blog_post += '\n\n\n\n'

    # print(get_contents_of_certain_file_in_certain_commit('b37ae0371d1', 'blogplish.py'))

    # a_diff_2_commits_back = get_diff_of_certain_file_in_certain_commit('c4b7c7cabccc350eef5ef80344f', 'f66b7bfd0f82d5b987d9f71f', THIS_SCRIPT_NAME)
    # print(a_diff_2_commits_back)

    return blog_post


blog_text = auto_blogplish_blog()
print(blog_text)

Now step 2, adding the diff of each file that was changed:

blogplish.py commit_id = f000c4

...

def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    # "reverse a list python": https://stackoverflow.com/questions/3940128/how-can-i-reverse-a-list-in-python
    parsed_commits.reverse()

    for index, commit_data in enumerate(parsed_commits):
        blog_post += commit_data['message']
        blog_post += '\n\n\n\n'
        this_commit_id = commit_data['commit_id']

        changed_files = get_files_that_were_changed_in_commit(this_commit_id)

        if changed_files:
            if index > 0:
                blog_post += '$$$ Diffs of changed files: $$$\n\n'
                for changed_file in changed_files:
                    older_commit_id = parsed_commits[index - 1]['commit_id']
                    this_diff = get_diff_of_certain_file_in_certain_commit(older_commit_id, this_commit_id, changed_file)
                    blog_post += '## ' + changed_file + ': ##\n\n'
                    blog_post += this_diff
                    blog_post += '\n\n\n\n'

            blog_post += '$$$ Entire contents of changed files: $$$\n\n'
            for changed_file in changed_files:
                contents = get_contents_of_certain_file_in_certain_commit(this_commit_id, changed_file)
                blog_post += '## ' + changed_file + ': ##\n\n'
                blog_post += contents
                blog_post += '\n\n\n\n'

    return blog_post


blog_text = auto_blogplish_blog()
print(blog_text)

The output is still rough, and the diffs printed out are hard to read. It can use a CLI to take you through each commit, each file, and let the author pick how to show the changes in the blog. A javascript UI might be much easier than using a CLI, as you can click what to keep and edit text in place much easier. Overall, in 1 day, after work, with no beer or caffiene in the house, I'd say Servando and I did pretty good.

blogplish.py commit_id = f000c4

import re
import sys
from subprocess import Popen, PIPE

THIS_SCRIPT_NAME = sys.argv[0]


"""
Get all commit info

For each commit in the commit info:
    Add commit message to a final string
    Add changes to final string
    Add entire files that were changed to final string
"""

def call_sp(command, *args, **kwargs):
    """ you can run command from any directory you want by passing in a kwarg of 'cwd' (current working directory):

        call_sp('ls -a', cwd='/home/username/projects/awesomeproject')
    """
    if args:
        command = command.format(*args)
    p = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs)
    output, err = p.communicate()
    return output, err


def parse_git_log_info(text_output):
    """ returns a commits_array like:

        [
            {'commit_id': '23hj3sz...', 'message': 'cleanup cruft'},
            {'commit_id': 'df8dje...', 'message': 'Changed paypal api setting to...'},
            ...
        ]
    """
    commit_start_rgx = r"^commit (?P<commit_id>\w{40})"
    lines = text_output.split('\n')
    commits_array = []
    current_commit_id = None
    current_commit_message_string = ""

    for line in lines:
        match = re.match(commit_start_rgx, line)
        if match:
            # this if block fails only once, on the first pass through
            if current_commit_id:
                commits_array.append({'commit_id': current_commit_id, 'message': current_commit_message_string.strip()})
            current_commit_id = match.group('commit_id')
            current_commit_message_string = ""
        else:
            if not line.startswith('Author: ') and not line.startswith('Date: '):
                current_commit_message_string += line

    return commits_array


def get_files_that_were_changed_in_commit(commit_id):
    # "get files that were changed in a commit": https://stackoverflow.com/questions/424071/how-to-list-all-the-files-in-a-commit
    output, error = call_sp('git diff-tree --no-commit-id --name-only -r %s' % commit_id)
    if error:
        raise Exception("Error in get_files_that_were_changed_in_commit():\n\n" + error)
    changed_files_intermediary = output.split('\n')
    # at first got a result like ['blogplish.py', '']
    changed_files = [this_file for this_file in changed_files_intermediary if this_file]
    return changed_files


def get_contents_of_certain_file_in_certain_commit(commit_id, filename):
    # "get contents of a certain file in a commit": https://stackoverflow.com/questions/2497051/how-can-i-show-the-contents-of-a-file-at-a-specific-state-of-a-git-repo
    output, error = call_sp('git show %s:%s' % (commit_id, filename))
    if error:
        raise Exception("Error in get_contents_of_certain_file_in_certain_commit():\n\n" + error)
    return output


def get_diff_of_certain_file_in_certain_commit(newer_commit_id, older_commit_id, filename):
    """
    head diff means how many commits back, as in

        HEAD~3

    means 3 commits back
    """
    # "get dif of a certain file in certain commit": https://stackoverflow.com/questions/42357521/generate-diff-file-of-a-specific-commit-in-git
    command = 'git diff {older_commit_id}..{newer_commit_id} {filename}'.format(older_commit_id=older_commit_id, newer_commit_id=newer_commit_id, filename=filename)
    raw_diff, error = call_sp(command)
    if error:
        raise Exception("Error in get_diff_of_certain_file_in_certain_commit():\n\n" + error)
    return raw_diff


def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    # "reverse a list python": https://stackoverflow.com/questions/3940128/how-can-i-reverse-a-list-in-python
    parsed_commits.reverse()

    for index, commit_data in enumerate(parsed_commits):
        blog_post += commit_data['message']
        blog_post += '\n\n\n\n'
        this_commit_id = commit_data['commit_id']

        changed_files = get_files_that_were_changed_in_commit(this_commit_id)

        if changed_files:
            if index > 0:
                blog_post += '$$$ Diffs of changed files: $$$\n\n'
                for changed_file in changed_files:
                    older_commit_id = parsed_commits[index - 1]['commit_id']
                    this_diff = get_diff_of_certain_file_in_certain_commit(older_commit_id, this_commit_id, changed_file)
                    blog_post += '## ' + changed_file + ': ##\n\n'
                    blog_post += this_diff
                    blog_post += '\n\n\n\n'

            blog_post += '$$$ Entire contents of changed files: $$$\n\n'
            for changed_file in changed_files:
                contents = get_contents_of_certain_file_in_certain_commit(this_commit_id, changed_file)
                blog_post += '## ' + changed_file + ': ##\n\n'
                blog_post += contents
                blog_post += '\n\n\n\n'

    return blog_post


blog_text = auto_blogplish_blog()
print(blog_text)

Before the blog you're reading was autogenerated for publishing, we commented out the lines that show the diffs. It also lacked some autoformatting, such as code blocks in ```, and filenames printed pretty:

blogplish.py commit_id = 5f8c6a

...

def auto_blogplish_blog():
    blog_post = ""

    output, error = call_sp('git log')

    parsed_commits = parse_git_log_info(output)
    # "reverse a list python": https://stackoverflow.com/questions/3940128/how-can-i-reverse-a-list-in-python
    parsed_commits.reverse()

    for index, commit_data in enumerate(parsed_commits):
        blog_post += commit_data['message']
        blog_post += '\n\n\n\n'
        this_commit_id = commit_data['commit_id']

        changed_files = get_files_that_were_changed_in_commit(this_commit_id)

        if changed_files:
            # if index > 0:
            #     blog_post += '$$$ Diffs of changed files: $$$\n\n'
            #     for changed_file in changed_files:
            #         older_commit_id = parsed_commits[index - 1]['commit_id']
            #         this_diff = get_diff_of_certain_file_in_certain_commit(older_commit_id, this_commit_id, changed_file)
            #         blog_post += '## ' + changed_file + ': ##\n\n'
            #         blog_post += this_diff
            #         blog_post += '\n\n\n\n'

            # blog_post += '$$$ Entire contents of changed files: $$$\n\n'
            for changed_file in changed_files:
                contents = get_contents_of_certain_file_in_certain_commit(this_commit_id, changed_file)
                # blog_post += '## ' + changed_file + ': ##\n\n'
                blog_post += '`' + changed_file + '`\n\n'
                blog_post += "```" + contents + "```"
                blog_post += '\n\n\n\n'

    return blog_post


blog_text = auto_blogplish_blog()
print(blog_text)

Almost done. Now just add a method to write this out to file:

blogplish.py commit_id = 4b9100

def write_content(the_file, content):
    with open(the_file, 'w') as f:
        f.write(content)

and use it at the bottom of the file:

blog_text = auto_blogplish_blog()

write_content('blog_rough_draft.md', blog_text)

Besides the edits, intro paragraph, and these last 2 paragraphs, this entire blog was autogenerated after the first stage of blogplish was finished, and the blog post itself was written as we made blogplish in each commit message. Not only does this force us to stop using lazy commits like "changed setting", or my one-word, scumbag favorite on personal projects: "updates", this script can save loads of time in later publishing tutorials or helping new teammates understand how a system was built at work.

We hope you enjoyed the blogplishness

Discover and read more posts from Cody Childers
get started
Enjoy this post?

Leave a like and comment for Cody

2
Be the first to share your opinion

Subscribe to our weekly newsletter