#archiveteam-bs 2018-01-14,Sun

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***BlueMaxim has joined #archiveteam-bs [00:26]
........... (idle for 50mn)
godane has quit IRC (Read error: Operation timed out)
godane has joined #archiveteam-bs
[01:16]
............. (idle for 1h3mn)
username1 has joined #archiveteam-bs [02:22]
schbirid2 has quit IRC (Read error: Operation timed out) [02:32]
................. (idle for 1h24mn)
Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [03:56]
........ (idle for 36mn)
Odd0002 has joined #archiveteam-bs [04:32]
qw3rty17 has joined #archiveteam-bs [04:43]
qw3rty16 has quit IRC (Read error: Operation timed out) [04:48]
octothorp has quit IRC (Read error: Connection reset by peer)
octothorp has joined #archiveteam-bs
[05:02]
..................................... (idle for 3h0mn)
phirephly has joined #archiveteam-bs [08:03]
................................ (idle for 2h39mn)
tuluu has quit IRC (Ping timeout: 260 seconds) [10:42]
jacketchaDoes anybody know how to archive DMs on instagram and twitter, or password protected pages? [10:53]
.............. (idle for 1h6mn)
username1for what purpose? [11:59]
jacketchaWell, if I'm going to get nuked, no point in not dumping every ounce of data you own
Also because it's in my will and it's hard to carry it out if your hard drives are vaporized
[12:12]
***odemg has quit IRC (Read error: Operation timed out) [12:14]
.......... (idle for 47mn)
odemg has joined #archiveteam-bs [13:01]
BlueMaxim has quit IRC (Read error: Connection reset by peer) [13:08]
............ (idle for 56mn)
Uzerusewwww, making first programm in life (aka command-line programm in Python) and i don;t know how to connect 2 things
parser (option parser) and that what do the thing, how to tell parser in what options should do what
its argparse module
[14:04]
***marvinw is now known as ivan [14:08]
ivanUzerus: generally you call a function to parse the options to an object and then look inside the object to do whatever you want
surely the thing you used came with an example of that?
[14:08]
Uzerusit's my 1st application, i never ever seen any module in my life, builded something like... wait ill pastebin
https://pastebin.com/vV8DF9Cn
so i see the namespace(,,,,,,,,,)
it should take file (log.gz), read it (#import gzip), check every domain it finds it exist on ignores (trash that domain) or if not in ignores and not in "done_scan" file, include as in line
1 domain 1 line...
so many many things i must learn, hope i ll do it faster or later, but maybe you know any python script that i can use as template?
[14:09]
ivanUzerus: https://gist.github.com/ivan/9c49d29f231e5a119ee64d5feb54be10
(an example of processing arguments, not an implementation of what you want)
can you link me to an example log file?
[14:14]
Uzerusits every ArchiveBot meta.gz file
https://archive.org/download/archiveteam_archivebot_go_20180104070001/urls-pastebin.com-DiLA7Av8-inf-20180102-215402-bxeom-meta.warc.gz
i can do it by zgrep -Po '((?<=http://)|(?<=https://))[^/]+(?=/)' *-meta.warc.gz | tail -n+2 | awk '!seen[$0]++'
thanks to JAA, but ill learn a little (and create something more portable)
[14:16]
ok, so i have next problems, how exactly classes works
im trying to not look into some1 else code, write from scratch for study things
ill ask when i will stuck :)
[14:31]
ivan: that what i write will take the file specified? i mean i can use ARGS.logfile class? [14:46]
***odemg has quit IRC (Ping timeout: 246 seconds) [14:47]
.... (idle for 16mn)
odemg has joined #archiveteam-bs [15:03]
.... (idle for 17mn)
odemg has quit IRC (Ping timeout: 480 seconds)
Jusque has quit IRC (Ping timeout: 250 seconds)
Jusque has joined #archiveteam-bs
[15:20]
odemg has joined #archiveteam-bs [15:30]
........ (idle for 38mn)
username1 has quit IRC (Quit: Leaving) [16:08]
Dimtree has quit IRC (Peace) [16:17]
.......... (idle for 45mn)
icedice has joined #archiveteam-bs [17:02]
....... (idle for 30mn)
icedice has quit IRC (Ping timeout: 245 seconds) [17:32]
dashcloud has joined #archiveteam-bs [17:38]
Mateon1 has quit IRC (Ping timeout: 260 seconds)
Mateon1 has joined #archiveteam-bs
[17:43]
............ (idle for 56mn)
atrocity has quit IRC (Ping timeout: 246 seconds) [18:40]
................... (idle for 1h34mn)
WubTheCap has joined #archiveteam-bs [20:14]
WubTheCapCrawling phpBB3 sucks as logged in user (open access boards with login required). Ended up with this script and regex filter list: https://paste.debian.net/plainh/0095af75
I ended up wasting 3.5 hours of crawling time to find out some of the regex filters because they were not listed on ArchiveTeam wiki yet
That 3.5 hours turned out to be 50% of the pages
Overall the first crawl took 9.5 hours, waiting for this second crawl to complete
Sorry, this was probably meant to be #archiveteam-ot
[20:19]
Even so, few pages have &sid= in the URI but also find the identical page without &sid= [20:27]
...... (idle for 25mn)
JAAYeah, session IDs are really annoying.
And check out the ArchiveBot ignore sets. At least some of those things are included in the 'forums' igset.
(That includes all sorts of forum softwares though, so it's a bit messy.)
And this is the right channel for that stuff. You *are* archiving the forum after all.
[20:52]
WubTheCapOddly enough my crawler never hit report.php, although it was not in the list
And yeah, thanks
[21:00]
......... (idle for 44mn)
godaneso i got 10 more vhs tapes from savers
i got a series of tapes called itty bitty kiddy wildlife
[21:44]
........... (idle for 50mn)
***Ravenloft has joined #archiveteam-bs [22:35]
...... (idle for 27mn)
godaneso i'm at about 19k items this month [23:02]
....... (idle for 33mn)
***TC01 has quit IRC (Read error: Operation timed out)
BlueMaxim has joined #archiveteam-bs
TC01 has joined #archiveteam-bs
[23:35]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)