| Time |
Nickname |
Message |
|
00:27
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
00:32
🔗
|
|
bsmith093 has joined #archiveteam-bs |
|
01:44
🔗
|
|
pizzaiolo has quit IRC (Read error: Operation timed out) |
|
01:44
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
01:56
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
02:01
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
02:03
🔗
|
|
RichardG_ is now known as RichardG |
|
02:08
🔗
|
|
BubuAnabe has joined #archiveteam-bs |
|
02:11
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) |
|
02:17
🔗
|
|
pizzaiolo has quit IRC (Ping timeout: 268 seconds) |
|
02:17
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
02:55
🔗
|
|
tsp_ has quit IRC (Read error: Connection reset by peer) |
|
02:59
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
03:00
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
03:22
🔗
|
|
qw3rty has joined #archiveteam-bs |
|
03:22
🔗
|
|
ZexaronS has quit IRC (Leaving) |
|
03:25
🔗
|
|
qw3rty2 has quit IRC (Read error: Operation timed out) |
|
03:38
🔗
|
|
ZexaronS has joined #archiveteam-bs |
|
04:06
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
|
04:12
🔗
|
|
Sk1d has joined #archiveteam-bs |
|
04:26
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
|
04:30
🔗
|
|
godane has joined #archiveteam-bs |
|
05:09
🔗
|
|
j08nY has joined #archiveteam-bs |
|
05:34
🔗
|
|
Fletcher_ has joined #archiveteam-bs |
|
05:36
🔗
|
|
j08nY has quit IRC (Read error: Operation timed out) |
|
06:21
🔗
|
|
Yoshimura has quit IRC (Ping timeout: 260 seconds) |
|
06:22
🔗
|
|
Yoshimura has joined #archiveteam-bs |
|
06:27
🔗
|
godane |
code i could use to grab libgen.pw pdfs : curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%2F|/|g' | sed 's|%3A|:|g' | sed 's|%26|\&|g' | sed 's|.*chl=||g' | sed 's|pdf.*|pdf|g' |
|
06:36
🔗
|
godane |
better code: curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|.*chl=||g' | sed 's|.*libgen.pw|http://libgen.pw|g' | sed 's|pdf.*|pdf|g' | sed 's|%2F|/|g' | sed 's| |_|g' | sed 's|%26|&|g' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%26|\&|g' |
|
07:39
🔗
|
|
brayden has quit IRC (Read error: Connection reset by peer) |
|
07:39
🔗
|
|
brayden has joined #archiveteam-bs |
|
07:39
🔗
|
|
swebb sets mode: +o brayden |
|
07:50
🔗
|
|
Jonison has joined #archiveteam-bs |
|
08:16
🔗
|
|
j08nY has joined #archiveteam-bs |
|
08:46
🔗
|
|
icedice has joined #archiveteam-bs |
|
09:53
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
|
10:30
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
|
12:02
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
|
12:28
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
|
12:35
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
|
12:40
🔗
|
|
Mateon1 has joined #archiveteam-bs |
|
12:42
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
|
12:44
🔗
|
|
j08nY has quit IRC (Read error: Operation timed out) |
|
12:44
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
|
12:46
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
13:25
🔗
|
|
eccfill has quit IRC (Read error: Connection reset by peer) |
|
13:26
🔗
|
|
eccfill has joined #archiveteam-bs |
|
13:31
🔗
|
|
Yurume has quit IRC (Read error: Operation timed out) |
|
13:35
🔗
|
|
Yurume has joined #archiveteam-bs |
|
13:43
🔗
|
|
bitBaron has joined #archiveteam-bs |
|
13:55
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
|
14:13
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) |
|
14:14
🔗
|
|
j08nY has joined #archiveteam-bs |
|
14:40
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
|
15:33
🔗
|
|
ZexaronS has quit IRC (Quit: Leaving) |
|
15:37
🔗
|
|
Jonison has quit IRC (Quit: Leaving) |
|
15:56
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
16:00
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
16:03
🔗
|
odemg |
godane, first line output url with spaces, second didnt get anything |
|
16:32
🔗
|
|
bitBaron has joined #archiveteam-bs |
|
16:32
🔗
|
|
bitBaron has quit IRC (Client Quit) |
|
16:34
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
|
16:34
🔗
|
|
bitBaron has joined #archiveteam-bs |
|
16:43
🔗
|
|
Famicoman has quit IRC (Ping timeout: 260 seconds) |
|
16:53
🔗
|
eccfill |
godane: look into djvu for magazine compression |
|
16:57
🔗
|
|
BubuAnabe has quit IRC (Ping timeout: 268 seconds) |
|
16:58
🔗
|
|
BubuAnabe has joined #archiveteam-bs |
|
17:16
🔗
|
|
Odd0002 has quit IRC (Read error: Operation timed out) |
|
17:27
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) |
|
17:28
🔗
|
|
bitBaron has joined #archiveteam-bs |
|
17:30
🔗
|
|
bitBaron has quit IRC (Client Quit) |
|
17:32
🔗
|
|
bitBaron has joined #archiveteam-bs |
|
17:32
🔗
|
|
bitBaron has quit IRC (Read error: Connection reset by peer) |
|
17:33
🔗
|
|
bitBaron has joined #archiveteam-bs |
|
17:42
🔗
|
Sanqui |
CLICK for Photos 📷 |
|
17:48
🔗
|
|
Jonison has joined #archiveteam-bs |
|
18:19
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
18:19
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
18:33
🔗
|
|
icedice has joined #archiveteam-bs |
|
18:46
🔗
|
|
bitBaron has quit IRC (Read error: Connection reset by peer) |
|
18:46
🔗
|
|
bitBaron_ has joined #archiveteam-bs |
|
18:46
🔗
|
tuluu |
Hi, is 'grab-site' the appropriate way to backup a webpage and create an warc file? |
|
18:55
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
18:59
🔗
|
|
dashcloud has joined #archiveteam-bs |
|
19:03
🔗
|
Meroje |
for one page webrecorder is probably easier to use |
|
19:16
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
|
19:34
🔗
|
ivan |
tuluu: I use it but then I wrote it so I could be biased |
|
19:35
🔗
|
ivan |
(is there a better self-serve crawling setup you can do? heritrix3? httrack with a proxy?) |
|
19:38
🔗
|
JAA |
Chromium/Firefox headless with warcprox, in theory. Can't wait to try that out when Firefox headless is released in September. |
|
19:38
🔗
|
ivan |
why not headless Chrome now? |
|
19:39
🔗
|
ivan |
I mean, that sounds cool but it's not a "crawl this entire site and linked pages" solution |
|
19:39
🔗
|
ivan |
maybe there's something on github that does that though |
|
19:39
🔗
|
JAA |
Because exactly that. I want to script it to follow links etc. And based on a quick glance at the documentation, Chromium's headless mode isn't exactly powerful. |
|
19:40
🔗
|
JAA |
It really looks like a programmatic interface to the rendering engine, basically. |
|
19:40
🔗
|
JAA |
But not a full browser which supports extensions (for the link following part etc.). |
|
19:41
🔗
|
ivan |
what's missing? you can do pretty much anything you'd be able to do from the devtools |
|
19:41
🔗
|
JAA |
You can? |
|
19:41
🔗
|
ivan |
yes |
|
19:41
🔗
|
JAA |
As I said, I just quickly skimmed the documentation. |
|
19:41
🔗
|
JAA |
If you can, sweet. |
|
19:43
🔗
|
|
TheLovina has quit IRC (Read error: Connection reset by peer) |
|
19:48
🔗
|
|
TheLovina has joined #archiveteam-bs |
|
19:59
🔗
|
tuluu |
ivan: thanks |
|
19:59
🔗
|
eccfill |
JAA: yeah, there's a repl mode |
|
20:00
🔗
|
eccfill |
and chrome headless speaks devtools protocol, so the same things that remote chrome debuggers use are available |
|
20:05
🔗
|
tuluu |
grab-site seems to work well. I will use that for now. |
|
20:29
🔗
|
JAA |
I guess I'll play around with headless Chromium in the next weeks then. |
|
20:29
🔗
|
|
Stilett0 has joined #archiveteam-bs |
|
20:33
🔗
|
JAA |
I'll still wait for headless Firefox before trying to build anything real though. I'm curious to see how they stack up against each other, in particular regarding resource usage. |
|
21:01
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
|
21:21
🔗
|
|
logchfoo3 starts logging #archiveteam-bs at Mon Jul 10 21:21:32 2017 |
|
21:21
🔗
|
|
logchfoo3 has joined #archiveteam-bs |
|
21:22
🔗
|
|
powerArch has quit IRC (Remote host closed the connection) |
|
21:29
🔗
|
|
Mateon1 has quit IRC (Quit: Mateon1) |
|
21:31
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
|
21:33
🔗
|
|
powerArch has joined #archiveteam-bs |
|
21:34
🔗
|
|
bwn_ is now known as bwn |
|
21:34
🔗
|
|
LeG0ax is now known as Ing3b0rg |
|
21:44
🔗
|
|
Famicoman has joined #archiveteam-bs |
|
21:57
🔗
|
|
Jonison has quit IRC (Read error: Connection reset by peer) |
|
22:11
🔗
|
|
purplebot has joined #archiveteam-bs |
|
22:12
🔗
|
|
voidsta has joined #archiveteam-bs |
|
22:13
🔗
|
|
i0npulse has joined #archiveteam-bs |
|
22:37
🔗
|
|
Hecatz has joined #archiveteam-bs |
|
22:41
🔗
|
JAA |
Argh, I hate wix.com sites. They're very annoying to archive as well. |
|
22:55
🔗
|
|
kimmer has joined #archiveteam-bs |
|
22:56
🔗
|
kimmer |
soo, there have been implemented rate lmits on the URL shortner services? |
|
22:57
🔗
|
xmc |
most shorteners will tell scrapers to get stuffed if they ask for too many urls |
|
22:57
🔗
|
xmc |
that's why more ip addresses is better |
|
22:58
🔗
|
kimmer |
allright.. guess it was takeing a bit of a load on their servers if we just hammered them. |
|
22:58
🔗
|
|
Odd0002 has joined #archiveteam-bs |
|
23:07
🔗
|
dashcloud |
Wanted to make sure people were aware of https://twitter.com/LinkArchiver - if you follow the account, anytime you tweet a link, it gets archived on IA, and if you @ the account, it'll tweet back the archive link |
|
23:07
🔗
|
|
agrecasci has joined #archiveteam-bs |
|
23:11
🔗
|
JAA |
Nice. Would be even better if it also archived the tweet. |
|
23:11
🔗
|
JAA |
But I guess that's a separate project. |
|
23:11
🔗
|
arkiver |
yeah, awesome project |
|
23:12
🔗
|
arkiver |
also a great way I think to get more people to know the wayback machine |
|
23:20
🔗
|
arkiver |
joepie91: finally fixed :) https://github.com/jjjake/internetarchive/issues/63 |
|
23:21
🔗
|
arkiver |
(it should be at least) |
|
23:35
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
|
23:47
🔗
|
DFJustin |
oh looks like they fixed the bookreader for special character filenames at some point too |
|
23:47
🔗
|
mundus201 |
What is the process for extracting all of the files from a .megawarc.warc.gz? |
|
23:51
🔗
|
arkiver |
https://github.com/chfoo/warcat |
|
23:52
🔗
|
arkiver |
mundus201 ^ |
|
23:54
🔗
|
mundus201 |
Gracias arkiver |