[00:00] i use wget url --mirror --warc-file=file for this? :p [00:03] godane: you sure that's everything? forum has 9075 threads [00:03] going by the counts in the column in http://www.atlus.com/forum/ [00:04] Of plain text. [00:05] Shouldn't be /too/ large. [00:08] godane: you're missing at least some higher-numbered threads like http://www.atlus.com/forum/showthread.php?p=278355 [00:08] That's not a thread id [00:08] That's a post id. [00:09] thread id is 9701 [00:09] e.g., http://www.atlus.com/forum/showthread.php?t=9701 [00:09] right, dang, these forum URLs are annoying [00:09] yeah, vBulletin is a heaping pile o' shit. [00:11] fuck [00:11] i did screw up [00:11] turns out the order=desc is used in forum pages [00:11] will try a 2nd grab [00:12] so for my info, what's wrong with running wget --mirror --warc-file=blah? [00:13] it respects robots.txt and gets blocked from everything for being wget [00:13] oh [00:13] it also doesn't wait between requests, not even a little bit [00:14] "respects robots.txt" [00:14] about that: http://atlus.com/robots.txt [00:14] all's fair. [00:14] now atlus doesnt have a robots.txt [00:15] running the same command again, will it add this to the warc file or will it try to download everything again? got a couple of error 500's [00:16] i'm doing a regrab [00:16] sorry about that [00:16] it grab everything but only for the first forum pages [00:17] still, would like to know :p [00:18] wget $website/forum/index.php --mirror --warc-file=$website-forum-grab-$(date +%Y%m%d) --warc-cdx -E --load-cookies=cookies.txt --accept-regex='(\?t=|\?f=|\.jpg|\.jpeg|\.png|\.gif)' --reject-regex='(highlight|daysprune|sort=|printthread.php|newthread.php|newreply.php|search.php|#|goto=|nojs=)' --warc-max-size=1G -E -o wget.log [00:18] this is before that command website="www.atlus.com" [00:19] the cookies.txt file is exported from my firefox browser [00:19] that way i don't have crappy sesson urls [00:19] there is no real default setup for this right? :P [00:20] this setup is meant to grab the info we want [00:21] and its so i can grab it quickly cause i'm on wifi and it could just drop on me [00:22] ok [00:22] anyways some of reject stuff makes sense [00:23] we don't need sorting or newthread pages [00:23] true [08:01] does anyone here use heritrix? I'd like someone to try fetch an URL for me to see if it interprets a robots.txt correctly https://archive.org/post/1004436/ [09:19] hmm [09:20] I have this directory of google video stuff that I am wondering if I can clear out. (i think I uploaded it a couple years ago) [09:20] 9G [14:35] what happens if in a tar -cf command I list a file to archive twice in the arguments? is it included twice [14:48] GRRRRRR [14:48] 500 Internal Server Error [14:48] [14:48] InternalErrorWe encountered an internal error. Please try again.33f032a1-266c-4367-9aef-2a1fe506988c [14:48] Sent 302356241122 bytes (100%) [16:53] Hi. [16:54] hello SketchCow [16:54] Someone should really be making a project out of sucking out data from forums like we have with wikis. [16:56] I agree; there are only a handful of really popular forum packages: phpBB, Invision Powerboard, simple machines forum, vBulletin, xenforo, and UBB are the most common I've seen [16:56] over half of those are proprietary though [16:56] but if others are interested, let's get started [16:57] tangential but still important: what's the current status of the yahoo group grabber? [16:57] let me recall who was working on that [17:00] ah, it was omf_ and he hasn't been seen for a while. alright [17:02] maybe it's possible to hook into that tapatalk addon most forums are adding these days [17:02] Plenty of websites are proprietary and we suck them out anyway. [17:02] It's all HLE stuff. [17:03] Laverne: good point. [17:03] I'll take a look [17:04] it's even documented [17:04] http://tapatalk.com/api/api_home.php [17:05] ha! http://www.cpcwiki.eu/forum/mobiquo/mobiquo.php [17:48] balrog: don't forget Vanilla [17:48] though Vanilla is easy [17:48] just append .json to every URL [17:49] (and their document layout is very good and easily parseable anyway) [18:05] Unlike wikis (or rather, MediaWiki), however, forums are usually not designed to provide an export of the underlying database data [18:06] Nemo_bis: indeed [18:06] Vanilla is the only one I know of [18:06] http://lowendtalk.com/discussion/16370.json [18:06] (as an example) [18:06] Is there an easy way to upload a file into an Item on Archive.org from the command line? [18:06] On the other hand, it's less important to have it because you mostly don't want to edit posts, so a static HTML archive is 80 % fine, where the 20 % left is 1) features like searching in metadata, 2) the possibility to resuscitate the forum [18:06] odie5533: `ia` or ias3upload [18:07] odie5533: yes, http://www.archive.org/help/abouts3.txt [18:07] anf it's many files, https://github.com/kngenie/ias3upload [18:07] odie5533: https://pypi.python.org/pypi/internetarchive [18:07] might also be useful [18:08] thanks, I think I'll try the ia tool [18:12] SketchCow: help me settle an argument in the reviews for my superman item. if IA gets a takedown notice for a thing, do they delete the thing or just dark it? [18:13] Is IA internally similar to S3, or only the API? [18:17] bsmith093: afaik they are darked [18:17] joepie91: I thought so, thanks [18:18] just dark it bsmith093 [18:18] everything goes dark [18:19] no law against storing afaik [18:20] Unless you are megaupload. [18:21] lol [18:33] they got in trouble for encouraging illegal usage. [18:33] and for profiting from it [18:34] meh I just fixed the yahoogroup dumper :) [18:35] the internetarchive python module existing, means that very soon we will have daily dumps of pastebin [18:35] :_) [18:35] :) * [18:37] https://github.com/balr0g/grabyahoogroup if someone wants to dump a yahoo group :) [18:39] bsmith093: Darked. [18:39] balrog: ooh, perl [18:39] eveyrone joepie91 SketchCow thanks , i figured that was it [18:39] odie5533: guess how yahoo broke it? [18:39] https://github.com/balr0g/grabyahoogroup/commit/351cb8bdc7f1a2c27dafa6757b154622685f15fc [18:40] ('|")? [18:40] it switched from single to double quotes? [18:40] no, they alternate between single and double quotes [18:40] ah [18:40] randomly [18:40] lol [18:40] you should be using a character set thing: ['"] [18:41] ah yeah you're right [18:41] well it really doesn't matter [18:41] what I did is the first thing I thought of and it's equivalent [18:42] not for your regex, but if you were doing replacements/extractions it would [18:42] it would be nice if I could make sure it's either both single or double [18:42] so it wouldn't pass 'blah" [18:42] huh? [18:42] right now, it will match 'blah" [18:43] Did you write that whole script? [18:43] in that case you want your grouping and a backreference, I think... ('|")otherstuff\1 perhaps? or depending on what the otherstuff is, it might be clearer to just ('otherstuff'|"otherstuff") [18:43] no, someone else did; I just fixed it up [18:44] ah [18:44] since I've become more comfortable with perl and regex since the last time I looked at it (back in january) [18:44] that is one giant ball of perl. [18:44] lol yep. [18:45] but it works and I don't feel like rewriting it. [18:45] What does it output? [18:45] A warc file? [18:46] no [18:46] separate .txt files for messages in mail message format [18:46] which it will produce an mbox from [18:47] directories of files for downloads and attachments, etc [18:47] the big annoying part is that to see anything except messages you need login/pass [18:47] for ANY group [18:47] and for some groups you need login/pass for anything at all [18:47] AND, most groups require admin approval [18:48] if all the admins are gone, you're outta luck [18:48] Yeah, yahoo groups is sorta closed like htat. [18:50] IA uses them though. [18:50] I find slight irony in IA using a Yahoo service [18:51] Why? It's not like Yahoo would up and delete one of their poducts. They've never done anything like that before. [18:52] * joepie91 tacks a "sarcasm" tag onto there in case anyone didn't catch it yet [18:54] :D [19:06] How can I add things to the ArchiveTeam collection on IA? [19:07] or set media type to web. [21:03] You can't. [21:03] SketchCow needs to add you as an admin of the collection before you're able to do that [21:03] Or just alert him of the items that needs moving at some time [21:18] i'm grabbing d-addicts.com forum posts [21:29] i'm using a more brute force grab to make it quick [22:18] ersi: Is that the same for all collections? Should items be sorted, or just uploaded to the Videos/Audios/Texts only? [22:19] odie5533: You only have permission to the Community * collections by default [22:58] mediatype is free though [22:58] (but only via s3), so do set it correctly, it makes your items prettier and derivation less stupid