[00:26] *** BlueMaxim has joined #archiveteam-bs [01:16] *** godane has quit IRC (Read error: Operation timed out) [01:19] *** godane has joined #archiveteam-bs [02:22] *** username1 has joined #archiveteam-bs [02:32] *** schbirid2 has quit IRC (Read error: Operation timed out) [03:56] *** Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [04:32] *** Odd0002 has joined #archiveteam-bs [04:43] *** qw3rty17 has joined #archiveteam-bs [04:48] *** qw3rty16 has quit IRC (Read error: Operation timed out) [05:02] *** octothorp has quit IRC (Read error: Connection reset by peer) [05:03] *** octothorp has joined #archiveteam-bs [08:03] *** phirephly has joined #archiveteam-bs [10:42] *** tuluu has quit IRC (Ping timeout: 260 seconds) [10:53] Does anybody know how to archive DMs on instagram and twitter, or password protected pages? [11:59] for what purpose? [12:12] Well, if I'm going to get nuked, no point in not dumping every ounce of data you own [12:13] Also because it's in my will and it's hard to carry it out if your hard drives are vaporized [12:14] *** odemg has quit IRC (Read error: Operation timed out) [13:01] *** odemg has joined #archiveteam-bs [13:08] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [14:04] ewwww, making first programm in life (aka command-line programm in Python) and i don;t know how to connect 2 things [14:05] parser (option parser) and that what do the thing, how to tell parser in what options should do what [14:05] its argparse module [14:08] *** marvinw is now known as ivan [14:08] Uzerus: generally you call a function to parse the options to an object and then look inside the object to do whatever you want [14:09] surely the thing you used came with an example of that? [14:09] it's my 1st application, i never ever seen any module in my life, builded something like... wait ill pastebin [14:10] https://pastebin.com/vV8DF9Cn [14:10] so i see the namespace(,,,,,,,,,) [14:12] it should take file (log.gz), read it (#import gzip), check every domain it finds it exist on ignores (trash that domain) or if not in ignores and not in "done_scan" file, include as in line [14:13] 1 domain 1 line... [14:14] so many many things i must learn, hope i ll do it faster or later, but maybe you know any python script that i can use as template? [14:14] Uzerus: https://gist.github.com/ivan/9c49d29f231e5a119ee64d5feb54be10 [14:15] (an example of processing arguments, not an implementation of what you want) [14:16] can you link me to an example log file? [14:16] its every ArchiveBot meta.gz file [14:17] https://archive.org/download/archiveteam_archivebot_go_20180104070001/urls-pastebin.com-DiLA7Av8-inf-20180102-215402-bxeom-meta.warc.gz [14:17] i can do it by zgrep -Po '((?<=http://)|(?<=https://))[^/]+(?=/)' *-meta.warc.gz | tail -n+2 | awk '!seen[$0]++' [14:18] thanks to JAA, but ill learn a little (and create something more portable) [14:31] ok, so i have next problems, how exactly classes works [14:32] im trying to not look into some1 else code, write from scratch for study things [14:33] ill ask when i will stuck :) [14:46] ivan: that what i write will take the file specified? i mean i can use ARGS.logfile class? [14:47] *** odemg has quit IRC (Ping timeout: 246 seconds) [15:03] *** odemg has joined #archiveteam-bs [15:20] *** odemg has quit IRC (Ping timeout: 480 seconds) [15:20] *** Jusque has quit IRC (Ping timeout: 250 seconds) [15:21] *** Jusque has joined #archiveteam-bs [15:30] *** odemg has joined #archiveteam-bs [16:08] *** username1 has quit IRC (Quit: Leaving) [16:17] *** Dimtree has quit IRC (Peace) [17:02] *** icedice has joined #archiveteam-bs [17:32] *** icedice has quit IRC (Ping timeout: 245 seconds) [17:38] *** dashcloud has joined #archiveteam-bs [17:43] *** Mateon1 has quit IRC (Ping timeout: 260 seconds) [17:44] *** Mateon1 has joined #archiveteam-bs [18:40] *** atrocity has quit IRC (Ping timeout: 246 seconds) [20:14] *** WubTheCap has joined #archiveteam-bs [20:19] Crawling phpBB3 sucks as logged in user (open access boards with login required). Ended up with this script and regex filter list: https://paste.debian.net/plainh/0095af75 [20:19] I ended up wasting 3.5 hours of crawling time to find out some of the regex filters because they were not listed on ArchiveTeam wiki yet [20:19] That 3.5 hours turned out to be 50% of the pages [20:20] Overall the first crawl took 9.5 hours, waiting for this second crawl to complete [20:20] Sorry, this was probably meant to be #archiveteam-ot [20:27] Even so, few pages have &sid= in the URI but also find the identical page without &sid= [20:52] Yeah, session IDs are really annoying. [20:53] And check out the ArchiveBot ignore sets. At least some of those things are included in the 'forums' igset. [20:54] (That includes all sorts of forum softwares though, so it's a bit messy.) [20:54] And this is the right channel for that stuff. You *are* archiving the forum after all. [21:00] Oddly enough my crawler never hit report.php, although it was not in the list [21:00] And yeah, thanks [21:44] so i got 10 more vhs tapes from savers [21:45] i got a series of tapes called itty bitty kiddy wildlife [22:35] *** Ravenloft has joined #archiveteam-bs [23:02] so i'm at about 19k items this month [23:35] *** TC01 has quit IRC (Read error: Operation timed out) [23:35] *** BlueMaxim has joined #archiveteam-bs [23:39] *** TC01 has joined #archiveteam-bs