#archiveteam 2013-11-11,Mon

↑back Search

Time Nickname Message
00:00 🔗 M1das i use wget url --mirror --warc-file=file for this? :p
00:03 🔗 ivan` godane: you sure that's everything? forum has 9075 threads
00:03 🔗 ivan` going by the counts in the column in http://www.atlus.com/forum/
00:04 🔗 Sellyme Of plain text.
00:05 🔗 Sellyme Shouldn't be /too/ large.
00:08 🔗 ivan` godane: you're missing at least some higher-numbered threads like http://www.atlus.com/forum/showthread.php?p=278355
00:08 🔗 Sellyme That's not a thread id
00:08 🔗 Sellyme That's a post id.
00:09 🔗 Sellyme thread id is 9701
00:09 🔗 Sellyme e.g., http://www.atlus.com/forum/showthread.php?t=9701
00:09 🔗 ivan` right, dang, these forum URLs are annoying
00:09 🔗 Sellyme yeah, vBulletin is a heaping pile o' shit.
00:11 🔗 godane fuck
00:11 🔗 godane i did screw up
00:11 🔗 godane turns out the order=desc is used in forum pages
00:11 🔗 godane will try a 2nd grab
00:12 🔗 M1das so for my info, what's wrong with running wget --mirror --warc-file=blah?
00:13 🔗 ivan` it respects robots.txt and gets blocked from everything for being wget
00:13 🔗 M1das oh
00:13 🔗 ivan` it also doesn't wait between requests, not even a little bit
00:14 🔗 Sellyme "respects robots.txt"
00:14 🔗 Sellyme about that: http://atlus.com/robots.txt
00:14 🔗 Sellyme all's fair.
00:14 🔗 M1das now atlus doesnt have a robots.txt
00:15 🔗 M1das running the same command again, will it add this to the warc file or will it try to download everything again? got a couple of error 500's
00:16 🔗 godane i'm doing a regrab
00:16 🔗 godane sorry about that
00:16 🔗 godane it grab everything but only for the first forum pages
00:17 🔗 M1das still, would like to know :p
00:18 🔗 godane wget $website/forum/index.php --mirror --warc-file=$website-forum-grab-$(date +%Y%m%d) --warc-cdx -E --load-cookies=cookies.txt --accept-regex='(\?t=|\?f=|\.jpg|\.jpeg|\.png|\.gif)' --reject-regex='(highlight|daysprune|sort=|printthread.php|newthread.php|newreply.php|search.php|#|goto=|nojs=)' --warc-max-size=1G -E -o wget.log
00:18 🔗 godane this is before that command website="www.atlus.com"
00:19 🔗 godane the cookies.txt file is exported from my firefox browser
00:19 🔗 godane that way i don't have crappy sesson urls
00:19 🔗 M1das there is no real default setup for this right? :P
00:20 🔗 godane this setup is meant to grab the info we want
00:21 🔗 godane and its so i can grab it quickly cause i'm on wifi and it could just drop on me
00:22 🔗 M1das ok
00:22 🔗 godane anyways some of reject stuff makes sense
00:23 🔗 godane we don't need sorting or newthread pages
00:23 🔗 M1das true
08:01 🔗 Nemo_bis does anyone here use heritrix? I'd like someone to try fetch an URL for me to see if it interprets a robots.txt correctly https://archive.org/post/1004436/
09:19 🔗 Coderjoe hmm
09:20 🔗 Coderjoe I have this directory of google video stuff that I am wondering if I can clear out. (i think I uploaded it a couple years ago)
09:20 🔗 Coderjoe 9G
14:35 🔗 Nemo_bis what happens if in a tar -cf command I list a file to archive twice in the arguments? is it included twice
14:48 🔗 Nemo_bis GRRRRRR
14:48 🔗 Nemo_bis 500 Internal Server Error
14:48 🔗 Nemo_bis <?xml version='1.0' encoding='UTF-8'?>
14:48 🔗 Nemo_bis <Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource/><RequestId>33f032a1-266c-4367-9aef-2a1fe506988c</RequestId></Error>
14:48 🔗 Nemo_bis Sent 302356241122 bytes (100%)
16:53 🔗 SketchCow Hi.
16:54 🔗 balrog hello SketchCow
16:54 🔗 SketchCow Someone should really be making a project out of sucking out data from forums like we have with wikis.
16:56 🔗 balrog I agree; there are only a handful of really popular forum packages: phpBB, Invision Powerboard, simple machines forum, vBulletin, xenforo, and UBB are the most common I've seen
16:56 🔗 balrog over half of those are proprietary though
16:56 🔗 balrog but if others are interested, let's get started
16:57 🔗 balrog tangential but still important: what's the current status of the yahoo group grabber?
16:57 🔗 balrog let me recall who was working on that
17:00 🔗 balrog ah, it was omf_ and he hasn't been seen for a while. alright
17:02 🔗 Laverne maybe it's possible to hook into that tapatalk addon most forums are adding these days
17:02 🔗 SketchCow Plenty of websites are proprietary and we suck them out anyway.
17:02 🔗 SketchCow It's all HLE stuff.
17:03 🔗 balrog Laverne: good point.
17:03 🔗 balrog I'll take a look
17:04 🔗 balrog it's even documented
17:04 🔗 balrog http://tapatalk.com/api/api_home.php
17:05 🔗 balrog ha! http://www.cpcwiki.eu/forum/mobiquo/mobiquo.php
17:48 🔗 joepie91 balrog: don't forget Vanilla
17:48 🔗 joepie91 though Vanilla is easy
17:48 🔗 joepie91 just append .json to every URL
17:49 🔗 joepie91 (and their document layout is very good and easily parseable anyway)
18:05 🔗 Nemo_bis Unlike wikis (or rather, MediaWiki), however, forums are usually not designed to provide an export of the underlying database data
18:06 🔗 joepie91 Nemo_bis: indeed
18:06 🔗 joepie91 Vanilla is the only one I know of
18:06 🔗 joepie91 http://lowendtalk.com/discussion/16370.json
18:06 🔗 joepie91 (as an example)
18:06 🔗 odie5533 Is there an easy way to upload a file into an Item on Archive.org from the command line?
18:06 🔗 Nemo_bis On the other hand, it's less important to have it because you mostly don't want to edit posts, so a static HTML archive is 80 % fine, where the 20 % left is 1) features like searching in metadata, 2) the possibility to resuscitate the forum
18:06 🔗 joepie91 odie5533: `ia` or ias3upload
18:07 🔗 Nemo_bis odie5533: yes, http://www.archive.org/help/abouts3.txt
18:07 🔗 Nemo_bis anf it's many files, https://github.com/kngenie/ias3upload
18:07 🔗 joepie91 odie5533: https://pypi.python.org/pypi/internetarchive
18:07 🔗 joepie91 might also be useful
18:08 🔗 odie5533 thanks, I think I'll try the ia tool
18:12 🔗 bsmith093 SketchCow: help me settle an argument in the reviews for my superman item. if IA gets a takedown notice for a thing, do they delete the thing or just dark it?
18:13 🔗 odie5533 Is IA internally similar to S3, or only the API?
18:17 🔗 joepie91 bsmith093: afaik they are darked
18:17 🔗 bsmith093 joepie91: I thought so, thanks
18:18 🔗 SmileyG just dark it bsmith093
18:18 🔗 SmileyG everything goes dark
18:19 🔗 SmileyG no law against storing afaik
18:20 🔗 phillipsj Unless you are megaupload.
18:21 🔗 SmileyG lol
18:33 🔗 odie5533 they got in trouble for encouraging illegal usage.
18:33 🔗 odie5533 and for profiting from it
18:34 🔗 balrog meh I just fixed the yahoogroup dumper :)
18:35 🔗 joepie91 the internetarchive python module existing, means that very soon we will have daily dumps of pastebin
18:35 🔗 joepie91 :_)
18:35 🔗 joepie91 :) *
18:37 🔗 balrog https://github.com/balr0g/grabyahoogroup if someone wants to dump a yahoo group :)
18:39 🔗 SketchCow bsmith093: Darked.
18:39 🔗 odie5533 balrog: ooh, perl
18:39 🔗 bsmith093 eveyrone joepie91 SketchCow thanks , i figured that was it
18:39 🔗 balrog odie5533: guess how yahoo broke it?
18:39 🔗 balrog https://github.com/balr0g/grabyahoogroup/commit/351cb8bdc7f1a2c27dafa6757b154622685f15fc
18:40 🔗 odie5533 ('|")?
18:40 🔗 odie5533 it switched from single to double quotes?
18:40 🔗 balrog no, they alternate between single and double quotes
18:40 🔗 odie5533 ah
18:40 🔗 balrog randomly
18:40 🔗 joepie91 lol
18:40 🔗 odie5533 you should be using a character set thing: ['"]
18:41 🔗 balrog ah yeah you're right
18:41 🔗 balrog well it really doesn't matter
18:41 🔗 balrog what I did is the first thing I thought of and it's equivalent
18:42 🔗 odie5533 not for your regex, but if you were doing replacements/extractions it would
18:42 🔗 balrog it would be nice if I could make sure it's either both single or double
18:42 🔗 balrog so it wouldn't pass 'blah"
18:42 🔗 odie5533 huh?
18:42 🔗 balrog right now, it will match 'blah"
18:43 🔗 odie5533 Did you write that whole script?
18:43 🔗 Baljem in that case you want your grouping and a backreference, I think... ('|")otherstuff\1 perhaps? or depending on what the otherstuff is, it might be clearer to just ('otherstuff'|"otherstuff")
18:43 🔗 balrog no, someone else did; I just fixed it up
18:44 🔗 odie5533 ah
18:44 🔗 balrog since I've become more comfortable with perl and regex since the last time I looked at it (back in january)
18:44 🔗 odie5533 that is one giant ball of perl.
18:44 🔗 balrog lol yep.
18:45 🔗 balrog but it works and I don't feel like rewriting it.
18:45 🔗 odie5533 What does it output?
18:45 🔗 odie5533 A warc file?
18:46 🔗 balrog no
18:46 🔗 balrog separate .txt files for messages in mail message format
18:46 🔗 balrog which it will produce an mbox from
18:47 🔗 balrog directories of files for downloads and attachments, etc
18:47 🔗 balrog the big annoying part is that to see anything except messages you need login/pass
18:47 🔗 balrog for ANY group
18:47 🔗 balrog and for some groups you need login/pass for anything at all
18:47 🔗 balrog AND, most groups require admin approval
18:48 🔗 balrog if all the admins are gone, you're outta luck
18:48 🔗 odie5533 Yeah, yahoo groups is sorta closed like htat.
18:50 🔗 odie5533 IA uses them though.
18:50 🔗 joepie91 I find slight irony in IA using a Yahoo service
18:51 🔗 odie5533 Why? It's not like Yahoo would up and delete one of their poducts. They've never done anything like that before.
18:52 🔗 * joepie91 tacks a "sarcasm" tag onto there in case anyone didn't catch it yet
18:54 🔗 SmileyG :D
19:06 🔗 odie5533 How can I add things to the ArchiveTeam collection on IA?
19:07 🔗 odie5533 or set media type to web.
21:03 🔗 ersi You can't.
21:03 🔗 ersi SketchCow needs to add you as an admin of the collection before you're able to do that
21:03 🔗 ersi Or just alert him of the items that needs moving at some time
21:18 🔗 godane i'm grabbing d-addicts.com forum posts
21:29 🔗 godane i'm using a more brute force grab to make it quick
22:18 🔗 odie5533 ersi: Is that the same for all collections? Should items be sorted, or just uploaded to the Videos/Audios/Texts only?
22:19 🔗 ersi odie5533: You only have permission to the Community * collections by default
22:58 🔗 Nemo_bis mediatype is free though
22:58 🔗 Nemo_bis (but only via s3), so do set it correctly, it makes your items prettier and derivation less stupid

irclogger-viewer