#archiveteam 2014-09-23,Tue

↑back Search

Time	Nickname	Message
01:10 ^🔗	primus	I would like to use archivebot to archive a site (http://retrospec.sgn.net/users/tomcat/yu/index.php), the problem with it is that in the links through site sometimes address pc.sux.org address is used and it redirects back to content on original site (namely link to magazines does that).
01:11 ^🔗	primus	Could someone help me with it since I've never used archivebot before?
01:12 ^🔗	primus	Dang, I edited the first msg so many times it's even worse than my usual english ;-)
02:38 ^🔗	Cybele	Hi, I'm trying to get my stories from https://archive.org/details/archiveteam-fanfiction-warc-01 I know they are in there according to the tar file
02:39 ^🔗	Cybele	I downloaded the 46GB warc but attempting to browse its contents has proved frustrating so far
02:39 ^🔗	Cybele	Can the other files be used to extract my stories from the warc?
02:40 ^🔗	RKenshin	[03:35] * Cybele (Mibbit@host86-138-31-130.range86-138.btcentralplus.com) Quit (http://www.mibbit.com ajax IRC Client)
02:40 ^🔗	RKenshin	[04:01] <@DFJustin> http://warctozip.archive.org/
02:40 ^🔗	RKenshin	[04:01] <@DFJustin> https://github.com/ArchiveTeam/warctozip
02:40 ^🔗	RKenshin	[04:01] <@DFJustin> warctozip
02:40 ^🔗	willwill	Try the wayback macchine?
02:40 ^🔗	Cybele	Fanfiction.net is blocked thanks to robots.txt
05:50 ^🔗	DFJustin	warctozip won't work on 46gb files unless you hack it to support zip64
05:51 ^🔗	DFJustin	primus: redirects like that should be ok since archivebot follows external links one level deep
08:34 ^🔗	midas	it might work with a warcproxy?
09:01 ^🔗	danneh_	hmm: http://www.infinidb.co/forum/important-announcement-infinidb
09:27 ^🔗	catbuster	Has anyone tried using PhantomJS to archive websites? I'm only looking at archiving individual URLs
09:32 ^🔗	midas	archivebot uses it on request
09:34 ^🔗	Rotab	wouldnt that make them into images? or is that just a side-feature of phantom? :P
10:29 ^🔗	catbuster	Rotab: You can also use it to create WARC files or simply just download all the resources using a proxy. Archive.today uses it for their crawling.
10:30 ^🔗	catbuster	midas: I'm looking for something I can use to archive a single URL, rather than an entire website, which is what I believe archivebot does.
10:31 ^🔗	midas	well, wget can do that
10:32 ^🔗	catbuster	But wget can't archive dynamic websites with a lot of javascript
10:33 ^🔗	catbuster	What does archivebot do when using PhantomJS? Using WarcMITMproxy?
10:33 ^🔗	Rotab	catbuster: ah :)
10:38 ^🔗	midas	catbuster: ask yipdw
10:39 ^🔗	catbuster	midas: Alright.
13:14 ^🔗	phuzion	What project should I be putting my resources towards? Something that I can throw lots of threads at?
14:00 ^🔗	dudly7635	Hey guys I was just thinking, with the way things have been going, would it be a good idea to just start archiving all torrent sites so a repeat of isohunt doesn't happen
14:02 ^🔗	dudly7635	Just figured I would put that out there, we could put them into the archivebot now and just let it run, of course would take a good while, but assuming htey're not going anywhere's for a while, would be a good start
14:02 ^🔗	dudly7635	they're*
16:34 ^🔗	balrog	uhh seriously? http://file.wikileaks.org/robots.txt
16:35 ^🔗	balrog	hit https://web.archive.org/web/20100606104835/http://file.wikileaks.org/file/ti-os-keys-dmca-2009.txt from wikipedia
16:40 ^🔗	xmc	wikileaks isn't cool any more
16:40 ^🔗	xmc	they're trying
16:45 ^🔗	schbirid	wtf
17:28 ^🔗	APerti	Since when does IA take into consideration robots.txt?
17:29 ^🔗	DFJustin	since always
17:46 ^🔗	ersi	Yeah, or well - IA takes it into account when displaying data. Like through the Wayback Machine. When it comes to ingesting/capturing.. heh, well - maybe to some extent. When it comes to storing data? No way.
17:58 ^🔗	xmc	fuck
17:58 ^🔗	xmc	it cut off "is offtopic"
18:44 ^🔗	soultcer	ois soultcer
18:44 ^🔗	soultcer	whoops
20:03 ^🔗	ersi	Hey soultcer! :)

irclogger-viewer