[02:26] folks, is there a git repo to pull the same scripts as the vm image runs for a custom machine? [02:28] https://github.com/ArchiveTeam/warrior-code2 [02:30] omf_: Thanks [02:59] https://code.google.com/p/httrack2arc/ via http://forum.httrack.com/readmsg/28483/24652/index.html [03:10] ivan`, I tried that program out, it failed on multiple httracks I tried. Digging into the code I found the problem in the regex used by LogReader.java [03:10] They use two really big ones that are far more complicated than necessary [03:11] ah, too bad [03:11] And I found their test suite to be laughable [03:11] https://code.google.com/p/httrack2arc/source/browse/trunk/src/pt/arquivo/httrack2arc/test/model/TestLogEntry.java [03:12] ivan`, no worries, we have plenty of httrack grabs so a better version of this program will happen [03:12] Also we just had a ton of projects and not much time for anything else [03:12] I have about 2000 httracks [03:13] I have a few hundred gigs including the opensolaris backup. [03:14] I had found that java converter when looking for a way to use httrack since it does not crash out like wget on some sites and has far more sophisticated configuration options. [03:19] it should be warc and not arc so much [03:43] So I was archiving a few Blogspot sites for my own personal use, and I noticed the wget command on the wiki doesn't backup images [03:43] blogspot images are hosted on a different hostname so you'd have to tweak it a bit [03:44] http://www.httrack.com/ new release [03:45] woah [03:56] this is my function for making a warc without thinking, am I missing anything? function quick-warc { wget --warc-file=$1 --warc-cdx --mirror --page-requisites --no-check-certificate -e robots=off http://$1/ } [04:02] Question: has there been any research into actually archiving TV Tropes [04:02] I know there's a lone wget command buried on the wiki [04:03] Empty ChangeLog, NEWS file and one sentence on a website is not a good way to communicate how your software is getting better. I still love httrack though [11:26] This morning, I'm only getting this on my posterous tracker: [11:26] Starting GetItemFromTracker for Item [11:26] No item received. Retrying after 30 seconds... [11:26] Is everything OK at your end? [11:26] No item received. Retrying after 30 seconds... [11:47] samwyse: yeah, there are no more items, unless the things in out get cycled [11:47] tomorrow there will be a lot of greader items :-) [11:48] also http://tracker.archiveteam.org/formspring/ [11:48] holy smokes, 64MB/s [11:56] so i'm grabing theesa.com site [11:57] there are only ~2400 files there so i think its a bit thin on grabs [11:59] -rw-r--r-- 1 tim.bowers games 16M Apr 19 12:07 ./rotavault.ign.com-2013-04-17.cdx [11:59] -rw-r--r-- 1 tim.bowers games 7.3G Apr 19 12:07 ./rotavault.ign.com-2013-04-17.warc [11:59] -rw-r--r-- 1 tim.bowers games 470M May 31 14:00 ./rotavault.ign.com-2013-04-19.cdx [11:59] -rw-r--r-- 1 tim.bowers games 52G May 31 14:01 ./rotavault.ign.com-2013-04-19.warc [11:59] Got OOM'ed in the end.... [12:00] Pouet still going :) [12:01] -rw-r--r-- 1 tim.bowers games 38G Jun 3 13:00 ./bin/ign/storage/pouet/pouet.net_06052013.warc [12:14] now this is very funny [12:15] there is a file called PulsePiracy.mpg [12:15] turns out that its a g4 segment from the show called Pulse [19:41] fuck vbox.. should have just leeched newest vmware workstation [19:42] now i gotta redo my pristine vm since it has fucking vbox drivers inside [19:43] (oops, wrong channel!) [21:52] Hi [21:52] I have a quick question [21:52] fire away Shicky256 [21:52] Someone will answer if they can [21:53] Why does Warrior say that there's no item received? [21:53] is the tracker downa. [21:53] Shicky256: what project are you running? [21:53] because there are no more items for posterous right now [21:54] I tried URLTeam as well, but it said no tasks available [21:54] http://tracker.archiveteam.org/formspring/ has a lot of items [21:55] yeah Formspring is the only active project atm [21:55] then why is posterous recommended instead of that? [21:55] URLTeam will return once it's swapped over to the new guys running it [21:55] Shicky256: because alard isn't around to change it atm [21:55] Cool [21:55] And I don't have access to the tracker to see how it's done :D [21:55] who else but alard can set the warrior priority? [21:55] my guesses are ersi..... and thats it [21:56] I don't know who else has tracker access from commandline. [21:56] underscor [21:56] What happened to the whole Formspring thing anyway? didn't it close over a month ago? [21:56] alard: ping when your around. [21:56] hmm [21:56] Shicky256: they got someone to buy it appently, however as you might well guess, that can mean *anything* [21:56] I don't know which redis key it is [21:56] so we still grabbing it, just in case :) [21:56] If someone knows, I have shell on the box [21:56] lets hope it isn't yahoo [21:57] Shicky256: hahaha I said that ;) [21:58] Seriously, yahoo closes everything. I give tumblr a year. [21:59] Poor tumblr [21:59] well, gotta go. [22:02] Marcelo: they will combine flickr and tumblr into some kind of mega product offering [22:02] I give it ..... 2 years [22:02] then eventually it'll all close [22:03] underscor: maybe warriorhq:projects_json, not sure if it's there, http://warriorhq.archiveteam.org/ is not responding for me [22:04] we really need to document how to do things like this D: [22:04] what? [22:05] It'd be epic if we could turn on xanga again, and add new users to it as we go along. [22:05] underscor: see also warrior-hq/set-projects-json.rb [22:36] the rotavault warc's are going up now :D