Time |
Nickname |
Message |
00:27
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
00:32
🔗
|
|
bsmith093 has joined #archiveteam-bs |
01:44
🔗
|
|
pizzaiolo has quit IRC (Read error: Operation timed out) |
01:44
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
01:56
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
02:01
🔗
|
|
dashcloud has joined #archiveteam-bs |
02:03
🔗
|
|
RichardG_ is now known as RichardG |
02:08
🔗
|
|
BubuAnabe has joined #archiveteam-bs |
02:11
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) |
02:17
🔗
|
|
pizzaiolo has quit IRC (Ping timeout: 268 seconds) |
02:17
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
02:55
🔗
|
|
tsp_ has quit IRC (Read error: Connection reset by peer) |
02:59
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
03:00
🔗
|
|
dashcloud has joined #archiveteam-bs |
03:22
🔗
|
|
qw3rty has joined #archiveteam-bs |
03:22
🔗
|
|
ZexaronS has quit IRC (Leaving) |
03:25
🔗
|
|
qw3rty2 has quit IRC (Read error: Operation timed out) |
03:38
🔗
|
|
ZexaronS has joined #archiveteam-bs |
04:06
🔗
|
|
Sk1d has quit IRC (Ping timeout: 194 seconds) |
04:12
🔗
|
|
Sk1d has joined #archiveteam-bs |
04:26
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
04:30
🔗
|
|
godane has joined #archiveteam-bs |
05:09
🔗
|
|
j08nY has joined #archiveteam-bs |
05:34
🔗
|
|
Fletcher_ has joined #archiveteam-bs |
05:36
🔗
|
|
j08nY has quit IRC (Read error: Operation timed out) |
06:21
🔗
|
|
Yoshimura has quit IRC (Ping timeout: 260 seconds) |
06:22
🔗
|
|
Yoshimura has joined #archiveteam-bs |
06:27
🔗
|
godane |
code i could use to grab libgen.pw pdfs : curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%2F|/|g' | sed 's|%3A|:|g' | sed 's|%26|\&|g' | sed 's|.*chl=||g' | sed 's|pdf.*|pdf|g' |
06:36
🔗
|
godane |
better code: curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|.*chl=||g' | sed 's|.*libgen.pw|http://libgen.pw|g' | sed 's|pdf.*|pdf|g' | sed 's|%2F|/|g' | sed 's| |_|g' | sed 's|%26|&|g' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%26|\&|g' |
07:39
🔗
|
|
brayden has quit IRC (Read error: Connection reset by peer) |
07:39
🔗
|
|
brayden has joined #archiveteam-bs |
07:39
🔗
|
|
swebb sets mode: +o brayden |
07:50
🔗
|
|
Jonison has joined #archiveteam-bs |
08:16
🔗
|
|
j08nY has joined #archiveteam-bs |
08:46
🔗
|
|
icedice has joined #archiveteam-bs |
09:53
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
10:30
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
12:02
🔗
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
12:28
🔗
|
|
BlueMaxim has quit IRC (Read error: Operation timed out) |
12:35
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
12:40
🔗
|
|
Mateon1 has joined #archiveteam-bs |
12:42
🔗
|
|
pizzaiolo has joined #archiveteam-bs |
12:44
🔗
|
|
j08nY has quit IRC (Read error: Operation timed out) |
12:44
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
12:46
🔗
|
|
dashcloud has joined #archiveteam-bs |
13:25
🔗
|
|
eccfill has quit IRC (Read error: Connection reset by peer) |
13:26
🔗
|
|
eccfill has joined #archiveteam-bs |
13:31
🔗
|
|
Yurume has quit IRC (Read error: Operation timed out) |
13:35
🔗
|
|
Yurume has joined #archiveteam-bs |
13:43
🔗
|
|
bitBaron has joined #archiveteam-bs |
13:55
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
14:13
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) |
14:14
🔗
|
|
j08nY has joined #archiveteam-bs |
14:40
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
15:33
🔗
|
|
ZexaronS has quit IRC (Quit: Leaving) |
15:37
🔗
|
|
Jonison has quit IRC (Quit: Leaving) |
15:56
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
16:00
🔗
|
|
dashcloud has joined #archiveteam-bs |
16:03
🔗
|
odemg |
godane, first line output url with spaces, second didnt get anything |
16:32
🔗
|
|
bitBaron has joined #archiveteam-bs |
16:32
🔗
|
|
bitBaron has quit IRC (Client Quit) |
16:34
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
16:34
🔗
|
|
bitBaron has joined #archiveteam-bs |
16:43
🔗
|
|
Famicoman has quit IRC (Ping timeout: 260 seconds) |
16:53
🔗
|
eccfill |
godane: look into djvu for magazine compression |
16:57
🔗
|
|
BubuAnabe has quit IRC (Ping timeout: 268 seconds) |
16:58
🔗
|
|
BubuAnabe has joined #archiveteam-bs |
17:16
🔗
|
|
Odd0002 has quit IRC (Read error: Operation timed out) |
17:27
🔗
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…) |
17:28
🔗
|
|
bitBaron has joined #archiveteam-bs |
17:30
🔗
|
|
bitBaron has quit IRC (Client Quit) |
17:32
🔗
|
|
bitBaron has joined #archiveteam-bs |
17:32
🔗
|
|
bitBaron has quit IRC (Read error: Connection reset by peer) |
17:33
🔗
|
|
bitBaron has joined #archiveteam-bs |
17:42
🔗
|
Sanqui |
CLICK for Photos 📷 |
17:48
🔗
|
|
Jonison has joined #archiveteam-bs |
18:19
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:19
🔗
|
|
dashcloud has joined #archiveteam-bs |
18:33
🔗
|
|
icedice has joined #archiveteam-bs |
18:46
🔗
|
|
bitBaron has quit IRC (Read error: Connection reset by peer) |
18:46
🔗
|
|
bitBaron_ has joined #archiveteam-bs |
18:46
🔗
|
tuluu |
Hi, is 'grab-site' the appropriate way to backup a webpage and create an warc file? |
18:55
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
18:59
🔗
|
|
dashcloud has joined #archiveteam-bs |
19:03
🔗
|
Meroje |
for one page webrecorder is probably easier to use |
19:16
🔗
|
|
SHODAN_UI has joined #archiveteam-bs |
19:34
🔗
|
ivan |
tuluu: I use it but then I wrote it so I could be biased |
19:35
🔗
|
ivan |
(is there a better self-serve crawling setup you can do? heritrix3? httrack with a proxy?) |
19:38
🔗
|
JAA |
Chromium/Firefox headless with warcprox, in theory. Can't wait to try that out when Firefox headless is released in September. |
19:38
🔗
|
ivan |
why not headless Chrome now? |
19:39
🔗
|
ivan |
I mean, that sounds cool but it's not a "crawl this entire site and linked pages" solution |
19:39
🔗
|
ivan |
maybe there's something on github that does that though |
19:39
🔗
|
JAA |
Because exactly that. I want to script it to follow links etc. And based on a quick glance at the documentation, Chromium's headless mode isn't exactly powerful. |
19:40
🔗
|
JAA |
It really looks like a programmatic interface to the rendering engine, basically. |
19:40
🔗
|
JAA |
But not a full browser which supports extensions (for the link following part etc.). |
19:41
🔗
|
ivan |
what's missing? you can do pretty much anything you'd be able to do from the devtools |
19:41
🔗
|
JAA |
You can? |
19:41
🔗
|
ivan |
yes |
19:41
🔗
|
JAA |
As I said, I just quickly skimmed the documentation. |
19:41
🔗
|
JAA |
If you can, sweet. |
19:43
🔗
|
|
TheLovina has quit IRC (Read error: Connection reset by peer) |
19:48
🔗
|
|
TheLovina has joined #archiveteam-bs |
19:59
🔗
|
tuluu |
ivan: thanks |
19:59
🔗
|
eccfill |
JAA: yeah, there's a repl mode |
20:00
🔗
|
eccfill |
and chrome headless speaks devtools protocol, so the same things that remote chrome debuggers use are available |
20:05
🔗
|
tuluu |
grab-site seems to work well. I will use that for now. |
20:29
🔗
|
JAA |
I guess I'll play around with headless Chromium in the next weeks then. |
20:29
🔗
|
|
Stilett0 has joined #archiveteam-bs |
20:33
🔗
|
JAA |
I'll still wait for headless Firefox before trying to build anything real though. I'm curious to see how they stack up against each other, in particular regarding resource usage. |
21:01
🔗
|
|
icedice has quit IRC (Quit: Leaving) |
21:21
🔗
|
|
logchfoo3 starts logging #archiveteam-bs at Mon Jul 10 21:21:32 2017 |
21:21
🔗
|
|
logchfoo3 has joined #archiveteam-bs |
21:22
🔗
|
|
powerArch has quit IRC (Remote host closed the connection) |
21:29
🔗
|
|
Mateon1 has quit IRC (Quit: Mateon1) |
21:31
🔗
|
|
SHODAN_UI has quit IRC (Remote host closed the connection) |
21:33
🔗
|
|
powerArch has joined #archiveteam-bs |
21:34
🔗
|
|
bwn_ is now known as bwn |
21:34
🔗
|
|
LeG0ax is now known as Ing3b0rg |
21:44
🔗
|
|
Famicoman has joined #archiveteam-bs |
21:57
🔗
|
|
Jonison has quit IRC (Read error: Connection reset by peer) |
22:11
🔗
|
|
purplebot has joined #archiveteam-bs |
22:12
🔗
|
|
voidsta has joined #archiveteam-bs |
22:13
🔗
|
|
i0npulse has joined #archiveteam-bs |
22:37
🔗
|
|
Hecatz has joined #archiveteam-bs |
22:41
🔗
|
JAA |
Argh, I hate wix.com sites. They're very annoying to archive as well. |
22:55
🔗
|
|
kimmer has joined #archiveteam-bs |
22:56
🔗
|
kimmer |
soo, there have been implemented rate lmits on the URL shortner services? |
22:57
🔗
|
xmc |
most shorteners will tell scrapers to get stuffed if they ask for too many urls |
22:57
🔗
|
xmc |
that's why more ip addresses is better |
22:58
🔗
|
kimmer |
allright.. guess it was takeing a bit of a load on their servers if we just hammered them. |
22:58
🔗
|
|
Odd0002 has joined #archiveteam-bs |
23:07
🔗
|
dashcloud |
Wanted to make sure people were aware of https://twitter.com/LinkArchiver - if you follow the account, anytime you tweet a link, it gets archived on IA, and if you @ the account, it'll tweet back the archive link |
23:07
🔗
|
|
agrecasci has joined #archiveteam-bs |
23:11
🔗
|
JAA |
Nice. Would be even better if it also archived the tweet. |
23:11
🔗
|
JAA |
But I guess that's a separate project. |
23:11
🔗
|
arkiver |
yeah, awesome project |
23:12
🔗
|
arkiver |
also a great way I think to get more people to know the wayback machine |
23:20
🔗
|
arkiver |
joepie91: finally fixed :) https://github.com/jjjake/internetarchive/issues/63 |
23:21
🔗
|
arkiver |
(it should be at least) |
23:35
🔗
|
|
j08nY has quit IRC (Quit: Leaving) |
23:47
🔗
|
DFJustin |
oh looks like they fixed the bookreader for special character filenames at some point too |
23:47
🔗
|
mundus201 |
What is the process for extracting all of the files from a .megawarc.warc.gz? |
23:51
🔗
|
arkiver |
https://github.com/chfoo/warcat |
23:52
🔗
|
arkiver |
mundus201 ^ |
23:54
🔗
|
mundus201 |
Gracias arkiver |