Time |
Nickname |
Message |
00:26
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:16
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
01:19
🔗
|
|
godane has joined #archiveteam-bs |
02:22
🔗
|
|
username1 has joined #archiveteam-bs |
02:32
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
03:56
🔗
|
|
Odd0002 has quit IRC (Quit: ZNC - http://znc.in) |
04:32
🔗
|
|
Odd0002 has joined #archiveteam-bs |
04:43
🔗
|
|
qw3rty17 has joined #archiveteam-bs |
04:48
🔗
|
|
qw3rty16 has quit IRC (Read error: Operation timed out) |
05:02
🔗
|
|
octothorp has quit IRC (Read error: Connection reset by peer) |
05:03
🔗
|
|
octothorp has joined #archiveteam-bs |
08:03
🔗
|
|
phirephly has joined #archiveteam-bs |
10:42
🔗
|
|
tuluu has quit IRC (Ping timeout: 260 seconds) |
10:53
🔗
|
jacketcha |
Does anybody know how to archive DMs on instagram and twitter, or password protected pages? |
11:59
🔗
|
username1 |
for what purpose? |
12:12
🔗
|
jacketcha |
Well, if I'm going to get nuked, no point in not dumping every ounce of data you own |
12:13
🔗
|
jacketcha |
Also because it's in my will and it's hard to carry it out if your hard drives are vaporized |
12:14
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
13:01
🔗
|
|
odemg has joined #archiveteam-bs |
13:08
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
14:04
🔗
|
Uzerus |
ewwww, making first programm in life (aka command-line programm in Python) and i don;t know how to connect 2 things |
14:05
🔗
|
Uzerus |
parser (option parser) and that what do the thing, how to tell parser in what options should do what |
14:05
🔗
|
Uzerus |
its argparse module |
14:08
🔗
|
|
marvinw is now known as ivan |
14:08
🔗
|
ivan |
Uzerus: generally you call a function to parse the options to an object and then look inside the object to do whatever you want |
14:09
🔗
|
ivan |
surely the thing you used came with an example of that? |
14:09
🔗
|
Uzerus |
it's my 1st application, i never ever seen any module in my life, builded something like... wait ill pastebin |
14:10
🔗
|
Uzerus |
https://pastebin.com/vV8DF9Cn |
14:10
🔗
|
Uzerus |
so i see the namespace(,,,,,,,,,) |
14:12
🔗
|
Uzerus |
it should take file (log.gz), read it (#import gzip), check every domain it finds it exist on ignores (trash that domain) or if not in ignores and not in "done_scan" file, include as in line |
14:13
🔗
|
Uzerus |
1 domain 1 line... |
14:14
🔗
|
Uzerus |
so many many things i must learn, hope i ll do it faster or later, but maybe you know any python script that i can use as template? |
14:14
🔗
|
ivan |
Uzerus: https://gist.github.com/ivan/9c49d29f231e5a119ee64d5feb54be10 |
14:15
🔗
|
ivan |
(an example of processing arguments, not an implementation of what you want) |
14:16
🔗
|
ivan |
can you link me to an example log file? |
14:16
🔗
|
Uzerus |
its every ArchiveBot meta.gz file |
14:17
🔗
|
Uzerus |
https://archive.org/download/archiveteam_archivebot_go_20180104070001/urls-pastebin.com-DiLA7Av8-inf-20180102-215402-bxeom-meta.warc.gz |
14:17
🔗
|
Uzerus |
i can do it by zgrep -Po '((?<=http://)|(?<=https://))[^/]+(?=/)' *-meta.warc.gz | tail -n+2 | awk '!seen[$0]++' |
14:18
🔗
|
Uzerus |
thanks to JAA, but ill learn a little (and create something more portable) |
14:31
🔗
|
Uzerus |
ok, so i have next problems, how exactly classes works |
14:32
🔗
|
Uzerus |
im trying to not look into some1 else code, write from scratch for study things |
14:33
🔗
|
Uzerus |
ill ask when i will stuck :) |
14:46
🔗
|
Uzerus |
ivan: that what i write will take the file specified? i mean i can use ARGS.logfile class? |
14:47
🔗
|
|
odemg has quit IRC (Ping timeout: 246 seconds) |
15:03
🔗
|
|
odemg has joined #archiveteam-bs |
15:20
🔗
|
|
odemg has quit IRC (Ping timeout: 480 seconds) |
15:20
🔗
|
|
Jusque has quit IRC (Ping timeout: 250 seconds) |
15:21
🔗
|
|
Jusque has joined #archiveteam-bs |
15:30
🔗
|
|
odemg has joined #archiveteam-bs |
16:08
🔗
|
|
username1 has quit IRC (Quit: Leaving) |
16:17
🔗
|
|
Dimtree has quit IRC (Peace) |
17:02
🔗
|
|
icedice has joined #archiveteam-bs |
17:32
🔗
|
|
icedice has quit IRC (Ping timeout: 245 seconds) |
17:38
🔗
|
|
dashcloud has joined #archiveteam-bs |
17:43
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 260 seconds) |
17:44
🔗
|
|
Mateon1 has joined #archiveteam-bs |
18:40
🔗
|
|
atrocity has quit IRC (Ping timeout: 246 seconds) |
20:14
🔗
|
|
WubTheCap has joined #archiveteam-bs |
20:19
🔗
|
WubTheCap |
Crawling phpBB3 sucks as logged in user (open access boards with login required). Ended up with this script and regex filter list: https://paste.debian.net/plainh/0095af75 |
20:19
🔗
|
WubTheCap |
I ended up wasting 3.5 hours of crawling time to find out some of the regex filters because they were not listed on ArchiveTeam wiki yet |
20:19
🔗
|
WubTheCap |
That 3.5 hours turned out to be 50% of the pages |
20:20
🔗
|
WubTheCap |
Overall the first crawl took 9.5 hours, waiting for this second crawl to complete |
20:20
🔗
|
WubTheCap |
Sorry, this was probably meant to be #archiveteam-ot |
20:27
🔗
|
WubTheCap |
Even so, few pages have &sid= in the URI but also find the identical page without &sid= |
20:52
🔗
|
JAA |
Yeah, session IDs are really annoying. |
20:53
🔗
|
JAA |
And check out the ArchiveBot ignore sets. At least some of those things are included in the 'forums' igset. |
20:54
🔗
|
JAA |
(That includes all sorts of forum softwares though, so it's a bit messy.) |
20:54
🔗
|
JAA |
And this is the right channel for that stuff. You *are* archiving the forum after all. |
21:00
🔗
|
WubTheCap |
Oddly enough my crawler never hit report.php, although it was not in the list |
21:00
🔗
|
WubTheCap |
And yeah, thanks |
21:44
🔗
|
godane |
so i got 10 more vhs tapes from savers |
21:45
🔗
|
godane |
i got a series of tapes called itty bitty kiddy wildlife |
22:35
🔗
|
|
Ravenloft has joined #archiveteam-bs |
23:02
🔗
|
godane |
so i'm at about 19k items this month |
23:35
🔗
|
|
TC01 has quit IRC (Read error: Operation timed out) |
23:35
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:39
🔗
|
|
TC01 has joined #archiveteam-bs |