#archiveteam-bs 2018-01-14,Sun

↑back Search

Time Nickname Message
00:26 🔗 BlueMaxim has joined #archiveteam-bs
01:16 🔗 godane has quit IRC (Read error: Operation timed out)
01:19 🔗 godane has joined #archiveteam-bs
02:22 🔗 username1 has joined #archiveteam-bs
02:32 🔗 schbirid2 has quit IRC (Read error: Operation timed out)
03:56 🔗 Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
04:32 🔗 Odd0002 has joined #archiveteam-bs
04:43 🔗 qw3rty17 has joined #archiveteam-bs
04:48 🔗 qw3rty16 has quit IRC (Read error: Operation timed out)
05:02 🔗 octothorp has quit IRC (Read error: Connection reset by peer)
05:03 🔗 octothorp has joined #archiveteam-bs
08:03 🔗 phirephly has joined #archiveteam-bs
10:42 🔗 tuluu has quit IRC (Ping timeout: 260 seconds)
10:53 🔗 jacketcha Does anybody know how to archive DMs on instagram and twitter, or password protected pages?
11:59 🔗 username1 for what purpose?
12:12 🔗 jacketcha Well, if I'm going to get nuked, no point in not dumping every ounce of data you own
12:13 🔗 jacketcha Also because it's in my will and it's hard to carry it out if your hard drives are vaporized
12:14 🔗 odemg has quit IRC (Read error: Operation timed out)
13:01 🔗 odemg has joined #archiveteam-bs
13:08 🔗 BlueMaxim has quit IRC (Read error: Connection reset by peer)
14:04 🔗 Uzerus ewwww, making first programm in life (aka command-line programm in Python) and i don;t know how to connect 2 things
14:05 🔗 Uzerus parser (option parser) and that what do the thing, how to tell parser in what options should do what
14:05 🔗 Uzerus its argparse module
14:08 🔗 marvinw is now known as ivan
14:08 🔗 ivan Uzerus: generally you call a function to parse the options to an object and then look inside the object to do whatever you want
14:09 🔗 ivan surely the thing you used came with an example of that?
14:09 🔗 Uzerus it's my 1st application, i never ever seen any module in my life, builded something like... wait ill pastebin
14:10 🔗 Uzerus https://pastebin.com/vV8DF9Cn
14:10 🔗 Uzerus so i see the namespace(,,,,,,,,,)
14:12 🔗 Uzerus it should take file (log.gz), read it (#import gzip), check every domain it finds it exist on ignores (trash that domain) or if not in ignores and not in "done_scan" file, include as in line
14:13 🔗 Uzerus 1 domain 1 line...
14:14 🔗 Uzerus so many many things i must learn, hope i ll do it faster or later, but maybe you know any python script that i can use as template?
14:14 🔗 ivan Uzerus: https://gist.github.com/ivan/9c49d29f231e5a119ee64d5feb54be10
14:15 🔗 ivan (an example of processing arguments, not an implementation of what you want)
14:16 🔗 ivan can you link me to an example log file?
14:16 🔗 Uzerus its every ArchiveBot meta.gz file
14:17 🔗 Uzerus https://archive.org/download/archiveteam_archivebot_go_20180104070001/urls-pastebin.com-DiLA7Av8-inf-20180102-215402-bxeom-meta.warc.gz
14:17 🔗 Uzerus i can do it by zgrep -Po '((?<=http://)|(?<=https://))[^/]+(?=/)' *-meta.warc.gz | tail -n+2 | awk '!seen[$0]++'
14:18 🔗 Uzerus thanks to JAA, but ill learn a little (and create something more portable)
14:31 🔗 Uzerus ok, so i have next problems, how exactly classes works
14:32 🔗 Uzerus im trying to not look into some1 else code, write from scratch for study things
14:33 🔗 Uzerus ill ask when i will stuck :)
14:46 🔗 Uzerus ivan: that what i write will take the file specified? i mean i can use ARGS.logfile class?
14:47 🔗 odemg has quit IRC (Ping timeout: 246 seconds)
15:03 🔗 odemg has joined #archiveteam-bs
15:20 🔗 odemg has quit IRC (Ping timeout: 480 seconds)
15:20 🔗 Jusque has quit IRC (Ping timeout: 250 seconds)
15:21 🔗 Jusque has joined #archiveteam-bs
15:30 🔗 odemg has joined #archiveteam-bs
16:08 🔗 username1 has quit IRC (Quit: Leaving)
16:17 🔗 Dimtree has quit IRC (Peace)
17:02 🔗 icedice has joined #archiveteam-bs
17:32 🔗 icedice has quit IRC (Ping timeout: 245 seconds)
17:38 🔗 dashcloud has joined #archiveteam-bs
17:43 🔗 Mateon1 has quit IRC (Ping timeout: 260 seconds)
17:44 🔗 Mateon1 has joined #archiveteam-bs
18:40 🔗 atrocity has quit IRC (Ping timeout: 246 seconds)
20:14 🔗 WubTheCap has joined #archiveteam-bs
20:19 🔗 WubTheCap Crawling phpBB3 sucks as logged in user (open access boards with login required). Ended up with this script and regex filter list: https://paste.debian.net/plainh/0095af75
20:19 🔗 WubTheCap I ended up wasting 3.5 hours of crawling time to find out some of the regex filters because they were not listed on ArchiveTeam wiki yet
20:19 🔗 WubTheCap That 3.5 hours turned out to be 50% of the pages
20:20 🔗 WubTheCap Overall the first crawl took 9.5 hours, waiting for this second crawl to complete
20:20 🔗 WubTheCap Sorry, this was probably meant to be #archiveteam-ot
20:27 🔗 WubTheCap Even so, few pages have &sid= in the URI but also find the identical page without &sid=
20:52 🔗 JAA Yeah, session IDs are really annoying.
20:53 🔗 JAA And check out the ArchiveBot ignore sets. At least some of those things are included in the 'forums' igset.
20:54 🔗 JAA (That includes all sorts of forum softwares though, so it's a bit messy.)
20:54 🔗 JAA And this is the right channel for that stuff. You *are* archiving the forum after all.
21:00 🔗 WubTheCap Oddly enough my crawler never hit report.php, although it was not in the list
21:00 🔗 WubTheCap And yeah, thanks
21:44 🔗 godane so i got 10 more vhs tapes from savers
21:45 🔗 godane i got a series of tapes called itty bitty kiddy wildlife
22:35 🔗 Ravenloft has joined #archiveteam-bs
23:02 🔗 godane so i'm at about 19k items this month
23:35 🔗 TC01 has quit IRC (Read error: Operation timed out)
23:35 🔗 BlueMaxim has joined #archiveteam-bs
23:39 🔗 TC01 has joined #archiveteam-bs

irclogger-viewer