[09:59] Is anyone active currently? [10:10] define "active" ;) [10:11] Well, helllo :) [10:11] I'm awake and working, but not doing anything archive-related. [10:11] I see, was just wondering what would be the best way to make a superficial archive of an entire forum. [10:12] I'm not into the technical details of the whole archiving operations, but I'm pretty sure there are people here who know how to do that. [10:13] http://archiveteam.org/index.php?title=Software [10:15] Do you know if these scrapers pick up where they leave off? Eg: if I close the app for the day and come back will it continue? [10:17] Ahh, read HTTrack's wikipedia and it seems I can, brilliant. Will read up more, thanks :) [10:17] Most do. ;) [10:26] wget won't. [10:26] I wouldn't say "most do", I'd say "test it" [10:27] Has anyone used SiteSucker for OSX? [10:27] People here mostly use wget and/or HTTrack [10:28] I've used it once or twice for small sites, but was wondering if it would be feasible for larger sites. Primarily using OSX for this. [10:30] Mmm, maybe I'll get a Windows user to help with the scraping then. [10:31] Sounds horrible. OS X and Windows for archiving. :) [10:31] Unless you archive straight into WARC output without touching the filthy filesystems of OS X or Windows [10:32] I'd wholeheartedly recommend using something that can produce WARC output (ie. save to the 'WARC format'). Since then it'll be of use for the Internet Archive (and plenty other archival organisations) [10:37] I'll look into it, I'm guessing it's some kind of container format for archives? [10:39] btw might be afk for a bit soonish [10:53] I think HTTrack has support. I know wget *has* support [11:02] You could always use WARCproxy and run HTTrack through it, its a bit roundabout, but it should work [11:14] Sum1: WARC is a format specifically for web archiving [11:14] it retains headers, error pages, and all the other metadata that you'd throw away when saving to disk [11:15] alard; awake? [11:57] joepie93: he's missing since a while [12:00] :/ [12:00] ersi: really need someone to write a pipeline for hyves [12:01] I havn't seen him arounds for *months* though [12:01] I wrote a user discovery script.. tbh I don't really have time for this project but I get the idea that if I don't, it won't get done at all [12:01] but I really can't afford time-wise to also write a pipeline [12:01] That's the general gist of all projects [12:01] "Do or it won't happen" [12:01] problem is that I'm too busy to run an entire project [12:02] some is better than none [12:02] yes, but user discovery alone is not going to help [12:02] sure it is, that's one less thing to do [12:02] keyword: "alone" [13:58] Cameron_D: So you know, I wrote a MITM alternative to WarcProxy that supports SSL. Two actually. [13:58] oh neat, link? [13:59] https://github.com/odie5533/WarcMITMProxy and https://github.com/odie5533/WarcTwistedMITMProxy [13:59] I do lots of scraping with Majestic 12, so I've been considering throwing that into the middle and then uploading the WARCs to IA [13:59] The former is probably more stable and complete at this point, but I'm now just working on the latter one. [14:01] because I definitely think going forward, having a stable, scalable WARC proxy is very important since it removes the need to keep rewriting WARC handling in various programs. [14:03] Cameron_D: If you're looking for alternative scrapers, give Scrapy a try. I've already got some groundwork completed in it. https://github.com/odie5533/WarcMiddleware [14:04] great, I'll keep a watch on those [15:07] I vaguely recall reading that you can submit WARC files to the Internet Archive's Wayback Machine, does anyone know if this is the case? [15:08] ask SketchCow or undersco2 [15:38] What [15:38] You can upload stuff to Internet Archive and alert us to it. [15:38] We have to look at it. [15:40] huh I hadn't thought of using httrack together with a warc proxy [15:41] that could be extremely useful [15:47] Sum1: http://archive.org/upload/ [16:54] godane: We already have an HPR collection someone is maintaining. [16:54] Other than that, I've been putting your items into collections. [17:07] SketchCow: it look like there was not all of them there [17:08] that was my only release for doing it [17:10] SketchCow: the collection mostly has 10 mp3s in a item [17:10] and stoped at 620 for some reason [17:11] then someone uploaded hpr1282 and put hpr1284 into that item too [17:11] its just crazy in the way its being done [17:15] SketchCow: also know that the geekbeattvreviews goes into computerandtechvideos collection [17:16] the way the collection is now it looks like its going to be under texts [17:21] SketchCow: there is also geekbeat.tv episodes in community videos [17:21] i'm up to about episode 702 now on that one [17:27] kind of a dumb question here: are warcs that are harvested uploaded to internet archive to become part of the general web collection ... available through wayback? [17:28] Yes. [17:28] nice [17:28] do the warcs separately go up there as files? [17:28] where they can be viewed as other uploaded files? [17:30] i'm giving a talk about web preservation in new zealand and want to really highlight the awesome work archiveteam does http://www.ndf.org.nz/ [17:30] so i need to get my facts straight :-D [17:31] Ah. [17:31] OK, so. [17:31] The way the Wayback machine works on Internet Archive is it looks at the web collection, in which there are WARC files. [17:31] There are indexers that figure out what URLs are backed up, and what item has that information, in what file. [17:32] Since the Internet Archive has done its own crawling (not just taken from Alexa Internet), it has done it this way. [17:32] What Archive Team did was make outsiders/"just folks" provide things to this collection. [17:32] So, the upshot is that we can add items to the web collection, but it has to be done by an admin. [17:33] It can't just happen, and this is why my life is filled with so many requests from darling AT members to make something web. [17:33] We have ways to return the item and go 'where is this from' and find the object. [17:33] Each time something is read, that item gets a read. [17:34] This is why these web objects will say "downloaded XXX times" - XXX is the amount of times people used wayback. [17:38] got it, thanks SketchCow [17:38] No problem. [18:06] https://archive.org/details/archiveteam_archivebot_go_002 still kicking ass [21:19] Well, in one way - it would be kind of stupid if we as outsiders could add material to the Wayback. Who knows what we put into our WARCs. :) [21:28] right, the custody issue [21:41] If I create a list of urls, can the warrior grab them? [21:46] Not unless you write a project for the warrior. [21:48] Does the Warrior work as a good dev environment for creating warrior projects? [22:09] I just booted it up for the first time, and it doesn't seem to be a good dev environment heh [22:34] SketchCow: I never knew that the number of times a web grab is downloaded is the number of times it's been viewed in the wayback [22:54] Yes