Time |
Nickname |
Message |
01:10
🔗
|
primus |
I would like to use archivebot to archive a site (http://retrospec.sgn.net/users/tomcat/yu/index.php), the problem with it is that in the links through site sometimes address pc.sux.org address is used and it redirects back to content on original site (namely link to magazines does that). |
01:11
🔗
|
primus |
Could someone help me with it since I've never used archivebot before? |
01:12
🔗
|
primus |
Dang, I edited the first msg so many times it's even worse than my usual english ;-) |
02:38
🔗
|
Cybele |
Hi, I'm trying to get my stories from https://archive.org/details/archiveteam-fanfiction-warc-01 I know they are in there according to the tar file |
02:39
🔗
|
Cybele |
I downloaded the 46GB warc but attempting to browse its contents has proved frustrating so far |
02:39
🔗
|
Cybele |
Can the other files be used to extract my stories from the warc? |
02:40
🔗
|
RKenshin |
[03:35] * Cybele (Mibbit@host86-138-31-130.range86-138.btcentralplus.com) Quit (http://www.mibbit.com ajax IRC Client) |
02:40
🔗
|
RKenshin |
[04:01] <@DFJustin> http://warctozip.archive.org/ |
02:40
🔗
|
RKenshin |
[04:01] <@DFJustin> https://github.com/ArchiveTeam/warctozip |
02:40
🔗
|
RKenshin |
[04:01] <@DFJustin> warctozip |
02:40
🔗
|
willwill |
Try the wayback macchine? |
02:40
🔗
|
Cybele |
Fanfiction.net is blocked thanks to robots.txt |
05:50
🔗
|
DFJustin |
warctozip won't work on 46gb files unless you hack it to support zip64 |
05:51
🔗
|
DFJustin |
primus: redirects like that should be ok since archivebot follows external links one level deep |
08:34
🔗
|
midas |
it might work with a warcproxy? |
09:01
🔗
|
danneh_ |
hmm: http://www.infinidb.co/forum/important-announcement-infinidb |
09:27
🔗
|
catbuster |
Has anyone tried using PhantomJS to archive websites? I'm only looking at archiving individual URLs |
09:32
🔗
|
midas |
archivebot uses it on request |
09:34
🔗
|
Rotab |
wouldnt that make them into images? or is that just a side-feature of phantom? :P |
10:29
🔗
|
catbuster |
Rotab: You can also use it to create WARC files or simply just download all the resources using a proxy. Archive.today uses it for their crawling. |
10:30
🔗
|
catbuster |
midas: I'm looking for something I can use to archive a single URL, rather than an entire website, which is what I believe archivebot does. |
10:31
🔗
|
midas |
well, wget can do that |
10:32
🔗
|
catbuster |
But wget can't archive dynamic websites with a lot of javascript |
10:33
🔗
|
catbuster |
What does archivebot do when using PhantomJS? Using WarcMITMproxy? |
10:33
🔗
|
Rotab |
catbuster: ah :) |
10:38
🔗
|
midas |
catbuster: ask yipdw |
10:39
🔗
|
catbuster |
midas: Alright. |
13:14
🔗
|
phuzion |
What project should I be putting my resources towards? Something that I can throw lots of threads at? |
14:00
🔗
|
dudly7635 |
Hey guys I was just thinking, with the way things have been going, would it be a good idea to just start archiving all torrent sites so a repeat of isohunt doesn't happen |
14:02
🔗
|
dudly7635 |
Just figured I would put that out there, we could put them into the archivebot now and just let it run, of course would take a good while, but assuming htey're not going anywhere's for a while, would be a good start |
14:02
🔗
|
dudly7635 |
they're* |
16:34
🔗
|
balrog |
uhh seriously? http://file.wikileaks.org/robots.txt |
16:35
🔗
|
balrog |
hit https://web.archive.org/web/20100606104835/http://file.wikileaks.org/file/ti-os-keys-dmca-2009.txt from wikipedia |
16:40
🔗
|
xmc |
wikileaks isn't cool any more |
16:40
🔗
|
xmc |
they're trying |
16:45
🔗
|
schbirid |
wtf |
17:28
🔗
|
APerti |
Since when does IA take into consideration robots.txt? |
17:29
🔗
|
DFJustin |
since always |
17:46
🔗
|
ersi |
Yeah, or well - IA takes it into account when displaying data. Like through the Wayback Machine. When it comes to ingesting/capturing.. heh, well - maybe to some extent. When it comes to storing data? No way. |
17:58
🔗
|
xmc |
fuck |
17:58
🔗
|
xmc |
it cut off "is offtopic" |
18:44
🔗
|
soultcer |
ois soultcer |
18:44
🔗
|
soultcer |
whoops |
20:03
🔗
|
ersi |
Hey soultcer! :) |