#archiveteam-bs 2017-07-10,Mon

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***BlueMaxim has joined #archiveteam-bs [00:27]
bsmith093 has joined #archiveteam-bs [00:32]
............... (idle for 1h12mn)
pizzaiolo has quit IRC (Read error: Operation timed out)
pizzaiolo has joined #archiveteam-bs
[01:44]
dashcloud has quit IRC (Read error: Operation timed out) [01:56]
dashcloud has joined #archiveteam-bs
RichardG_ is now known as RichardG
[02:01]
BubuAnabe has joined #archiveteam-bs
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
[02:08]
pizzaiolo has quit IRC (Ping timeout: 268 seconds)
pizzaiolo has joined #archiveteam-bs
[02:17]
........ (idle for 38mn)
tsp_ has quit IRC (Read error: Connection reset by peer)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[02:55]
..... (idle for 22mn)
qw3rty has joined #archiveteam-bs
ZexaronS has quit IRC (Leaving)
qw3rty2 has quit IRC (Read error: Operation timed out)
[03:22]
ZexaronS has joined #archiveteam-bs [03:38]
...... (idle for 28mn)
Sk1d has quit IRC (Ping timeout: 194 seconds) [04:06]
Sk1d has joined #archiveteam-bs [04:12]
godane has quit IRC (Read error: Operation timed out)
godane has joined #archiveteam-bs
[04:26]
........ (idle for 39mn)
j08nY has joined #archiveteam-bs [05:09]
...... (idle for 25mn)
Fletcher_ has joined #archiveteam-bs
j08nY has quit IRC (Read error: Operation timed out)
[05:34]
.......... (idle for 45mn)
Yoshimura has quit IRC (Ping timeout: 260 seconds)
Yoshimura has joined #archiveteam-bs
[06:21]
godanecode i could use to grab libgen.pw pdfs : curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%2F|/|g' | sed 's|%3A|:|g' | sed 's|%26|\&|g' | sed 's|.*chl=||g' | sed 's|pdf.*|pdf|g' [06:27]
better code: curl -s https://libgen.pw/download.php?id=184042 | grep 'img src=' | sed 's|.*chl=||g' | sed 's|.*libgen.pw|http://libgen.pw|g' | sed 's|pdf.*|pdf|g' | sed 's|%2F|/|g' | sed 's| |_|g' | sed 's|%26|&|g' | sed 's|%3D|=|g' | sed 's|%3F|?|g' | sed 's|%26|\&|g' [06:36]
............. (idle for 1h3mn)
***brayden has quit IRC (Read error: Connection reset by peer)
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[07:39]
Jonison has joined #archiveteam-bs [07:50]
...... (idle for 26mn)
j08nY has joined #archiveteam-bs [08:16]
....... (idle for 30mn)
icedice has joined #archiveteam-bs [08:46]
.............. (idle for 1h7mn)
SHODAN_UI has joined #archiveteam-bs [09:53]
........ (idle for 37mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [10:30]
................... (idle for 1h32mn)
pizzaiolo has quit IRC (Remote host closed the connection) [12:02]
...... (idle for 26mn)
BlueMaxim has quit IRC (Read error: Operation timed out) [12:28]
SHODAN_UI has joined #archiveteam-bs [12:35]
Mateon1 has joined #archiveteam-bs
pizzaiolo has joined #archiveteam-bs
j08nY has quit IRC (Read error: Operation timed out)
dashcloud has quit IRC (Remote host closed the connection)
dashcloud has joined #archiveteam-bs
[12:40]
........ (idle for 39mn)
eccfill has quit IRC (Read error: Connection reset by peer)
eccfill has joined #archiveteam-bs
[13:25]
Yurume has quit IRC (Read error: Operation timed out)
Yurume has joined #archiveteam-bs
[13:31]
bitBaron has joined #archiveteam-bs [13:43]
icedice has quit IRC (Quit: Leaving) [13:55]
.... (idle for 18mn)
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
j08nY has joined #archiveteam-bs
[14:13]
...... (idle for 26mn)
SHODAN_UI has quit IRC (Remote host closed the connection) [14:40]
........... (idle for 53mn)
ZexaronS has quit IRC (Quit: Leaving)
Jonison has quit IRC (Quit: Leaving)
[15:33]
.... (idle for 19mn)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[15:56]
odemggodane, first line output url with spaces, second didnt get anything [16:03]
...... (idle for 29mn)
***bitBaron has joined #archiveteam-bs
bitBaron has quit IRC (Client Quit)
Stiletto has quit IRC (Read error: Operation timed out)
bitBaron has joined #archiveteam-bs
[16:32]
Famicoman has quit IRC (Ping timeout: 260 seconds) [16:43]
eccfillgodane: look into djvu for magazine compression [16:53]
***BubuAnabe has quit IRC (Ping timeout: 268 seconds)
BubuAnabe has joined #archiveteam-bs
[16:57]
.... (idle for 18mn)
Odd0002 has quit IRC (Read error: Operation timed out) [17:16]
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzz…)
bitBaron has joined #archiveteam-bs
bitBaron has quit IRC (Client Quit)
bitBaron has joined #archiveteam-bs
bitBaron has quit IRC (Read error: Connection reset by peer)
bitBaron has joined #archiveteam-bs
[17:27]
SanquiCLICK for Photos 📷 [17:42]
***Jonison has joined #archiveteam-bs [17:48]
....... (idle for 31mn)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[18:19]
icedice has joined #archiveteam-bs [18:33]
bitBaron has quit IRC (Read error: Connection reset by peer)
bitBaron_ has joined #archiveteam-bs
[18:46]
tuluuHi, is 'grab-site' the appropriate way to backup a webpage and create an warc file? [18:46]
***dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[18:55]
Merojefor one page webrecorder is probably easier to use [19:03]
***SHODAN_UI has joined #archiveteam-bs [19:16]
.... (idle for 18mn)
ivantuluu: I use it but then I wrote it so I could be biased
(is there a better self-serve crawling setup you can do? heritrix3? httrack with a proxy?)
[19:34]
JAAChromium/Firefox headless with warcprox, in theory. Can't wait to try that out when Firefox headless is released in September. [19:38]
ivanwhy not headless Chrome now?
I mean, that sounds cool but it's not a "crawl this entire site and linked pages" solution
maybe there's something on github that does that though
[19:38]
JAABecause exactly that. I want to script it to follow links etc. And based on a quick glance at the documentation, Chromium's headless mode isn't exactly powerful.
It really looks like a programmatic interface to the rendering engine, basically.
But not a full browser which supports extensions (for the link following part etc.).
[19:39]
ivanwhat's missing? you can do pretty much anything you'd be able to do from the devtools [19:41]
JAAYou can? [19:41]
ivanyes [19:41]
JAAAs I said, I just quickly skimmed the documentation.
If you can, sweet.
[19:41]
***TheLovina has quit IRC (Read error: Connection reset by peer) [19:43]
TheLovina has joined #archiveteam-bs [19:48]
tuluuivan: thanks [19:59]
eccfillJAA: yeah, there's a repl mode
and chrome headless speaks devtools protocol, so the same things that remote chrome debuggers use are available
[19:59]
tuluugrab-site seems to work well. I will use that for now. [20:05]
..... (idle for 24mn)
JAAI guess I'll play around with headless Chromium in the next weeks then. [20:29]
***Stilett0 has joined #archiveteam-bs [20:29]
JAAI'll still wait for headless Firefox before trying to build anything real though. I'm curious to see how they stack up against each other, in particular regarding resource usage. [20:33]
...... (idle for 28mn)
***icedice has quit IRC (Quit: Leaving) [21:01]
..... (idle for 20mn)
logchfoo3 starts logging #archiveteam-bs at Mon Jul 10 21:21:32 2017
logchfoo3 has joined #archiveteam-bs
powerArch has quit IRC (Remote host closed the connection)
[21:21]
Mateon1 has quit IRC (Quit: Mateon1)
SHODAN_UI has quit IRC (Remote host closed the connection)
powerArch has joined #archiveteam-bs
bwn_ is now known as bwn
LeG0ax is now known as Ing3b0rg
[21:29]
Famicoman has joined #archiveteam-bs [21:44]
Jonison has quit IRC (Read error: Connection reset by peer) [21:57]
purplebot has joined #archiveteam-bs
voidsta has joined #archiveteam-bs
i0npulse has joined #archiveteam-bs
[22:11]
..... (idle for 24mn)
Hecatz has joined #archiveteam-bs [22:37]
JAAArgh, I hate wix.com sites. They're very annoying to archive as well. [22:41]
***kimmer has joined #archiveteam-bs [22:55]
kimmersoo, there have been implemented rate lmits on the URL shortner services? [22:56]
xmcmost shorteners will tell scrapers to get stuffed if they ask for too many urls
that's why more ip addresses is better
[22:57]
kimmerallright.. guess it was takeing a bit of a load on their servers if we just hammered them. [22:58]
***Odd0002 has joined #archiveteam-bs [22:58]
dashcloudWanted to make sure people were aware of https://twitter.com/LinkArchiver - if you follow the account, anytime you tweet a link, it gets archived on IA, and if you @ the account, it'll tweet back the archive link [23:07]
***agrecasci has joined #archiveteam-bs [23:07]
JAANice. Would be even better if it also archived the tweet.
But I guess that's a separate project.
[23:11]
arkiveryeah, awesome project
also a great way I think to get more people to know the wayback machine
[23:11]
joepie91: finally fixed :) https://github.com/jjjake/internetarchive/issues/63
(it should be at least)
[23:20]
***j08nY has quit IRC (Quit: Leaving) [23:35]
DFJustinoh looks like they fixed the bookreader for special character filenames at some point too [23:47]
mundus201What is the process for extracting all of the files from a .megawarc.warc.gz? [23:47]
arkiverhttps://github.com/chfoo/warcat
mundus201 ^
[23:51]
mundus201Gracias arkiver [23:54]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)