#archiveteam 2014-09-23,Tue

↑back Search

Time Nickname Message
01:10 🔗 primus I would like to use archivebot to archive a site (http://retrospec.sgn.net/users/tomcat/yu/index.php), the problem with it is that in the links through site sometimes address pc.sux.org address is used and it redirects back to content on original site (namely link to magazines does that).
01:11 🔗 primus Could someone help me with it since I've never used archivebot before?
01:12 🔗 primus Dang, I edited the first msg so many times it's even worse than my usual english ;-)
02:38 🔗 Cybele Hi, I'm trying to get my stories from https://archive.org/details/archiveteam-fanfiction-warc-01 I know they are in there according to the tar file
02:39 🔗 Cybele I downloaded the 46GB warc but attempting to browse its contents has proved frustrating so far
02:39 🔗 Cybele Can the other files be used to extract my stories from the warc?
02:40 🔗 RKenshin [03:35] * Cybele (Mibbit@host86-138-31-130.range86-138.btcentralplus.com) Quit (http://www.mibbit.com ajax IRC Client)
02:40 🔗 RKenshin [04:01] <@DFJustin> http://warctozip.archive.org/
02:40 🔗 RKenshin [04:01] <@DFJustin> https://github.com/ArchiveTeam/warctozip
02:40 🔗 RKenshin [04:01] <@DFJustin> warctozip
02:40 🔗 willwill Try the wayback macchine?
02:40 🔗 Cybele Fanfiction.net is blocked thanks to robots.txt
05:50 🔗 DFJustin warctozip won't work on 46gb files unless you hack it to support zip64
05:51 🔗 DFJustin primus: redirects like that should be ok since archivebot follows external links one level deep
08:34 🔗 midas it might work with a warcproxy?
09:01 🔗 danneh_ hmm: http://www.infinidb.co/forum/important-announcement-infinidb
09:27 🔗 catbuster Has anyone tried using PhantomJS to archive websites? I'm only looking at archiving individual URLs
09:32 🔗 midas archivebot uses it on request
09:34 🔗 Rotab wouldnt that make them into images? or is that just a side-feature of phantom? :P
10:29 🔗 catbuster Rotab: You can also use it to create WARC files or simply just download all the resources using a proxy. Archive.today uses it for their crawling.
10:30 🔗 catbuster midas: I'm looking for something I can use to archive a single URL, rather than an entire website, which is what I believe archivebot does.
10:31 🔗 midas well, wget can do that
10:32 🔗 catbuster But wget can't archive dynamic websites with a lot of javascript
10:33 🔗 catbuster What does archivebot do when using PhantomJS? Using WarcMITMproxy?
10:33 🔗 Rotab catbuster: ah :)
10:38 🔗 midas catbuster: ask yipdw
10:39 🔗 catbuster midas: Alright.
13:14 🔗 phuzion What project should I be putting my resources towards? Something that I can throw lots of threads at?
14:00 🔗 dudly7635 Hey guys I was just thinking, with the way things have been going, would it be a good idea to just start archiving all torrent sites so a repeat of isohunt doesn't happen
14:02 🔗 dudly7635 Just figured I would put that out there, we could put them into the archivebot now and just let it run, of course would take a good while, but assuming htey're not going anywhere's for a while, would be a good start
14:02 🔗 dudly7635 they're*
16:34 🔗 balrog uhh seriously? http://file.wikileaks.org/robots.txt
16:35 🔗 balrog hit https://web.archive.org/web/20100606104835/http://file.wikileaks.org/file/ti-os-keys-dmca-2009.txt from wikipedia
16:40 🔗 xmc wikileaks isn't cool any more
16:40 🔗 xmc they're trying
16:45 🔗 schbirid wtf
17:28 🔗 APerti Since when does IA take into consideration robots.txt?
17:29 🔗 DFJustin since always
17:46 🔗 ersi Yeah, or well - IA takes it into account when displaying data. Like through the Wayback Machine. When it comes to ingesting/capturing.. heh, well - maybe to some extent. When it comes to storing data? No way.
17:58 🔗 xmc fuck
17:58 🔗 xmc it cut off "is offtopic"
18:44 🔗 soultcer ois soultcer
18:44 🔗 soultcer whoops
20:03 🔗 ersi Hey soultcer! :)

irclogger-viewer