#archiveteam-bs 2018-01-14,Sun

↑back Search

Time	Nickname	Message
00:26 ^🔗		BlueMaxim has joined #archiveteam-bs
01:16 ^🔗		godane has quit IRC (Read error: Operation timed out)
01:19 ^🔗		godane has joined #archiveteam-bs
02:22 ^🔗		username1 has joined #archiveteam-bs
02:32 ^🔗		schbirid2 has quit IRC (Read error: Operation timed out)
03:56 ^🔗		Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
04:32 ^🔗		Odd0002 has joined #archiveteam-bs
04:43 ^🔗		qw3rty17 has joined #archiveteam-bs
04:48 ^🔗		qw3rty16 has quit IRC (Read error: Operation timed out)
05:02 ^🔗		octothorp has quit IRC (Read error: Connection reset by peer)
05:03 ^🔗		octothorp has joined #archiveteam-bs
08:03 ^🔗		phirephly has joined #archiveteam-bs
10:42 ^🔗		tuluu has quit IRC (Ping timeout: 260 seconds)
10:53 ^🔗	jacketcha	Does anybody know how to archive DMs on instagram and twitter, or password protected pages?
11:59 ^🔗	username1	for what purpose?
12:12 ^🔗	jacketcha	Well, if I'm going to get nuked, no point in not dumping every ounce of data you own
12:13 ^🔗	jacketcha	Also because it's in my will and it's hard to carry it out if your hard drives are vaporized
12:14 ^🔗		odemg has quit IRC (Read error: Operation timed out)
13:01 ^🔗		odemg has joined #archiveteam-bs
13:08 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
14:04 ^🔗	Uzerus	ewwww, making first programm in life (aka command-line programm in Python) and i don;t know how to connect 2 things
14:05 ^🔗	Uzerus	parser (option parser) and that what do the thing, how to tell parser in what options should do what
14:05 ^🔗	Uzerus	its argparse module
14:08 ^🔗		marvinw is now known as ivan
14:08 ^🔗	ivan	Uzerus: generally you call a function to parse the options to an object and then look inside the object to do whatever you want
14:09 ^🔗	ivan	surely the thing you used came with an example of that?
14:09 ^🔗	Uzerus	it's my 1st application, i never ever seen any module in my life, builded something like... wait ill pastebin
14:10 ^🔗	Uzerus	https://pastebin.com/vV8DF9Cn
14:10 ^🔗	Uzerus	so i see the namespace(,,,,,,,,,)
14:12 ^🔗	Uzerus	it should take file (log.gz), read it (#import gzip), check every domain it finds it exist on ignores (trash that domain) or if not in ignores and not in "done_scan" file, include as in line
14:13 ^🔗	Uzerus	1 domain 1 line...
14:14 ^🔗	Uzerus	so many many things i must learn, hope i ll do it faster or later, but maybe you know any python script that i can use as template?
14:14 ^🔗	ivan	Uzerus: https://gist.github.com/ivan/9c49d29f231e5a119ee64d5feb54be10
14:15 ^🔗	ivan	(an example of processing arguments, not an implementation of what you want)
14:16 ^🔗	ivan	can you link me to an example log file?
14:16 ^🔗	Uzerus	its every ArchiveBot meta.gz file
14:17 ^🔗	Uzerus	https://archive.org/download/archiveteam_archivebot_go_20180104070001/urls-pastebin.com-DiLA7Av8-inf-20180102-215402-bxeom-meta.warc.gz
14:17 ^🔗	Uzerus	i can do it by zgrep -Po '((?<=http://)\|(?<=https://))[^/]+(?=/)' *-meta.warc.gz \| tail -n+2 \| awk '!seen[$0]++'
14:18 ^🔗	Uzerus	thanks to JAA, but ill learn a little (and create something more portable)
14:31 ^🔗	Uzerus	ok, so i have next problems, how exactly classes works
14:32 ^🔗	Uzerus	im trying to not look into some1 else code, write from scratch for study things
14:33 ^🔗	Uzerus	ill ask when i will stuck :)
14:46 ^🔗	Uzerus	ivan: that what i write will take the file specified? i mean i can use ARGS.logfile class?
14:47 ^🔗		odemg has quit IRC (Ping timeout: 246 seconds)
15:03 ^🔗		odemg has joined #archiveteam-bs
15:20 ^🔗		odemg has quit IRC (Ping timeout: 480 seconds)
15:20 ^🔗		Jusque has quit IRC (Ping timeout: 250 seconds)
15:21 ^🔗		Jusque has joined #archiveteam-bs
15:30 ^🔗		odemg has joined #archiveteam-bs
16:08 ^🔗		username1 has quit IRC (Quit: Leaving)
16:17 ^🔗		Dimtree has quit IRC (Peace)
17:02 ^🔗		icedice has joined #archiveteam-bs
17:32 ^🔗		icedice has quit IRC (Ping timeout: 245 seconds)
17:38 ^🔗		dashcloud has joined #archiveteam-bs
17:43 ^🔗		Mateon1 has quit IRC (Ping timeout: 260 seconds)
17:44 ^🔗		Mateon1 has joined #archiveteam-bs
18:40 ^🔗		atrocity has quit IRC (Ping timeout: 246 seconds)
20:14 ^🔗		WubTheCap has joined #archiveteam-bs
20:19 ^🔗	WubTheCap	Crawling phpBB3 sucks as logged in user (open access boards with login required). Ended up with this script and regex filter list: https://paste.debian.net/plainh/0095af75
20:19 ^🔗	WubTheCap	I ended up wasting 3.5 hours of crawling time to find out some of the regex filters because they were not listed on ArchiveTeam wiki yet
20:19 ^🔗	WubTheCap	That 3.5 hours turned out to be 50% of the pages
20:20 ^🔗	WubTheCap	Overall the first crawl took 9.5 hours, waiting for this second crawl to complete
20:20 ^🔗	WubTheCap	Sorry, this was probably meant to be #archiveteam-ot
20:27 ^🔗	WubTheCap	Even so, few pages have &sid= in the URI but also find the identical page without &sid=
20:52 ^🔗	JAA	Yeah, session IDs are really annoying.
20:53 ^🔗	JAA	And check out the ArchiveBot ignore sets. At least some of those things are included in the 'forums' igset.
20:54 ^🔗	JAA	(That includes all sorts of forum softwares though, so it's a bit messy.)
20:54 ^🔗	JAA	And this is the right channel for that stuff. You are archiving the forum after all.
21:00 ^🔗	WubTheCap	Oddly enough my crawler never hit report.php, although it was not in the list
21:00 ^🔗	WubTheCap	And yeah, thanks
21:44 ^🔗	godane	so i got 10 more vhs tapes from savers
21:45 ^🔗	godane	i got a series of tapes called itty bitty kiddy wildlife
22:35 ^🔗		Ravenloft has joined #archiveteam-bs
23:02 ^🔗	godane	so i'm at about 19k items this month
23:35 ^🔗		TC01 has quit IRC (Read error: Operation timed out)
23:35 ^🔗		BlueMaxim has joined #archiveteam-bs
23:39 ^🔗		TC01 has joined #archiveteam-bs

irclogger-viewer