Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions samples/amcache.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,13 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import print_function
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i always forget to do this, thanks!

import sys
import logging
import datetime
from collections import namedtuple
import json
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit pick: i like to order my imports by line length (with group of stdlib, pip-installable, and project-local modules).


import argparse
import unicodecsv
Expand Down Expand Up @@ -175,6 +178,8 @@ def main(argv=None):
help="Enable verbose output")
parser.add_argument("-t", action="store_true", dest="do_timeline",
help="Output in simple timeline format")
parser.add_argument("-j", action="store_true", dest="do_json",
help="Output in JSON-formatted strings")
args = parser.parse_args(argv[1:])

if args.verbose:
Expand Down Expand Up @@ -213,6 +218,19 @@ def main(argv=None):
w.writerow(["timestamp", "timestamp_type", "path", "sha1"])
for e in sorted(entries, key=lambda e: e.timestamp):
w.writerow([e.timestamp, e.type, e.entry.path, e.entry.sha1])

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should add check to see that both -j and -t are not both provided. i assume the user would figure out pretty quickly on their own, but its best to be explicit.

elif args.do_json:
for e in ee:
document = {}
for i in FIELDS:
document[i.name] = getattr(e, i.name)
if document[i.name] is None:
document[i.name] = "-"
elif type(document[i.name]) == datetime.datetime:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend using isinstance(val, datetime.datetime) over type(...) == .... the former works more intuitively when dealing with inheritance (doesn't affect us here, but i like to be consistent).

document[i.name] = str(document[i.name])
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here makes sense, though i'd recommend using a temporary variable rather than indexing into the document dict so many times. perhaps something like:

                val = getattr(e, i.name)
                if val is None:
                    val = "-"
                elif type(val) == datetime.datetime:
                    val = str(val)
                document[i.name] = val

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking a look at the python documentation, we can provide a default argument to getattr and simplify the code:

                val = getattr(e, i.name, "-")
                if isinstance(val, datetime.datetime):
                    val = str(val)
                document[i.name] = val

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally, we should probably be explicit in our timestamp formatting, so recommend using:

                val = getattr(e, i.name, "-")
                if isinstance(val, datetime.datetime):
                    val = val.isoformat(" ")
                document[i.name] = val


print(json.dumps(document, ensure_ascii=False).encode("utf-8"))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the unicode handling here looks good to me. nice work! json.dumps returns either str or unicode, which you correctly encode into a specific representation.

on windows, i've occasionally hit issues where stdout is open in text mode, so when writing binary data to stdout, the windows shell inserts some unexpected bytes. however, since here we're dealing with encoded text, i think we should be ok.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, the output format of the -j mode is not json (a single document), but a collection of json documents. i've seen this called jsonl before (http://jsonlines.org/).

making the output a single document makes it easier for most programs to ingest, but more difficult to process streaming. jsonl works a bit better for processing incrementally. what format do you think we should use?


else:
w = unicodecsv.writer(sys.stdout, delimiter="|", quotechar="\"",
quoting=unicodecsv.QUOTE_MINIMAL, encoding="utf-8")
Expand Down