Time |
Nickname |
Message |
00:00
🔗
|
M1das |
i use wget url --mirror --warc-file=file for this? :p |
00:03
🔗
|
ivan` |
godane: you sure that's everything? forum has 9075 threads |
00:03
🔗
|
ivan` |
going by the counts in the column in http://www.atlus.com/forum/ |
00:04
🔗
|
Sellyme |
Of plain text. |
00:05
🔗
|
Sellyme |
Shouldn't be /too/ large. |
00:08
🔗
|
ivan` |
godane: you're missing at least some higher-numbered threads like http://www.atlus.com/forum/showthread.php?p=278355 |
00:08
🔗
|
Sellyme |
That's not a thread id |
00:08
🔗
|
Sellyme |
That's a post id. |
00:09
🔗
|
Sellyme |
thread id is 9701 |
00:09
🔗
|
Sellyme |
e.g., http://www.atlus.com/forum/showthread.php?t=9701 |
00:09
🔗
|
ivan` |
right, dang, these forum URLs are annoying |
00:09
🔗
|
Sellyme |
yeah, vBulletin is a heaping pile o' shit. |
00:11
🔗
|
godane |
fuck |
00:11
🔗
|
godane |
i did screw up |
00:11
🔗
|
godane |
turns out the order=desc is used in forum pages |
00:11
🔗
|
godane |
will try a 2nd grab |
00:12
🔗
|
M1das |
so for my info, what's wrong with running wget --mirror --warc-file=blah? |
00:13
🔗
|
ivan` |
it respects robots.txt and gets blocked from everything for being wget |
00:13
🔗
|
M1das |
oh |
00:13
🔗
|
ivan` |
it also doesn't wait between requests, not even a little bit |
00:14
🔗
|
Sellyme |
"respects robots.txt" |
00:14
🔗
|
Sellyme |
about that: http://atlus.com/robots.txt |
00:14
🔗
|
Sellyme |
all's fair. |
00:14
🔗
|
M1das |
now atlus doesnt have a robots.txt |
00:15
🔗
|
M1das |
running the same command again, will it add this to the warc file or will it try to download everything again? got a couple of error 500's |
00:16
🔗
|
godane |
i'm doing a regrab |
00:16
🔗
|
godane |
sorry about that |
00:16
🔗
|
godane |
it grab everything but only for the first forum pages |
00:17
🔗
|
M1das |
still, would like to know :p |
00:18
🔗
|
godane |
wget $website/forum/index.php --mirror --warc-file=$website-forum-grab-$(date +%Y%m%d) --warc-cdx -E --load-cookies=cookies.txt --accept-regex='(\?t=|\?f=|\.jpg|\.jpeg|\.png|\.gif)' --reject-regex='(highlight|daysprune|sort=|printthread.php|newthread.php|newreply.php|search.php|#|goto=|nojs=)' --warc-max-size=1G -E -o wget.log |
00:18
🔗
|
godane |
this is before that command website="www.atlus.com" |
00:19
🔗
|
godane |
the cookies.txt file is exported from my firefox browser |
00:19
🔗
|
godane |
that way i don't have crappy sesson urls |
00:19
🔗
|
M1das |
there is no real default setup for this right? :P |
00:20
🔗
|
godane |
this setup is meant to grab the info we want |
00:21
🔗
|
godane |
and its so i can grab it quickly cause i'm on wifi and it could just drop on me |
00:22
🔗
|
M1das |
ok |
00:22
🔗
|
godane |
anyways some of reject stuff makes sense |
00:23
🔗
|
godane |
we don't need sorting or newthread pages |
00:23
🔗
|
M1das |
true |
08:01
🔗
|
Nemo_bis |
does anyone here use heritrix? I'd like someone to try fetch an URL for me to see if it interprets a robots.txt correctly https://archive.org/post/1004436/ |
09:19
🔗
|
Coderjoe |
hmm |
09:20
🔗
|
Coderjoe |
I have this directory of google video stuff that I am wondering if I can clear out. (i think I uploaded it a couple years ago) |
09:20
🔗
|
Coderjoe |
9G |
14:35
🔗
|
Nemo_bis |
what happens if in a tar -cf command I list a file to archive twice in the arguments? is it included twice |
14:48
🔗
|
Nemo_bis |
GRRRRRR |
14:48
🔗
|
Nemo_bis |
500 Internal Server Error |
14:48
🔗
|
Nemo_bis |
<?xml version='1.0' encoding='UTF-8'?> |
14:48
🔗
|
Nemo_bis |
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><Resource/><RequestId>33f032a1-266c-4367-9aef-2a1fe506988c</RequestId></Error> |
14:48
🔗
|
Nemo_bis |
Sent 302356241122 bytes (100%) |
16:53
🔗
|
SketchCow |
Hi. |
16:54
🔗
|
balrog |
hello SketchCow |
16:54
🔗
|
SketchCow |
Someone should really be making a project out of sucking out data from forums like we have with wikis. |
16:56
🔗
|
balrog |
I agree; there are only a handful of really popular forum packages: phpBB, Invision Powerboard, simple machines forum, vBulletin, xenforo, and UBB are the most common I've seen |
16:56
🔗
|
balrog |
over half of those are proprietary though |
16:56
🔗
|
balrog |
but if others are interested, let's get started |
16:57
🔗
|
balrog |
tangential but still important: what's the current status of the yahoo group grabber? |
16:57
🔗
|
balrog |
let me recall who was working on that |
17:00
🔗
|
balrog |
ah, it was omf_ and he hasn't been seen for a while. alright |
17:02
🔗
|
Laverne |
maybe it's possible to hook into that tapatalk addon most forums are adding these days |
17:02
🔗
|
SketchCow |
Plenty of websites are proprietary and we suck them out anyway. |
17:02
🔗
|
SketchCow |
It's all HLE stuff. |
17:03
🔗
|
balrog |
Laverne: good point. |
17:03
🔗
|
balrog |
I'll take a look |
17:04
🔗
|
balrog |
it's even documented |
17:04
🔗
|
balrog |
http://tapatalk.com/api/api_home.php |
17:05
🔗
|
balrog |
ha! http://www.cpcwiki.eu/forum/mobiquo/mobiquo.php |
17:48
🔗
|
joepie91 |
balrog: don't forget Vanilla |
17:48
🔗
|
joepie91 |
though Vanilla is easy |
17:48
🔗
|
joepie91 |
just append .json to every URL |
17:49
🔗
|
joepie91 |
(and their document layout is very good and easily parseable anyway) |
18:05
🔗
|
Nemo_bis |
Unlike wikis (or rather, MediaWiki), however, forums are usually not designed to provide an export of the underlying database data |
18:06
🔗
|
joepie91 |
Nemo_bis: indeed |
18:06
🔗
|
joepie91 |
Vanilla is the only one I know of |
18:06
🔗
|
joepie91 |
http://lowendtalk.com/discussion/16370.json |
18:06
🔗
|
joepie91 |
(as an example) |
18:06
🔗
|
odie5533 |
Is there an easy way to upload a file into an Item on Archive.org from the command line? |
18:06
🔗
|
Nemo_bis |
On the other hand, it's less important to have it because you mostly don't want to edit posts, so a static HTML archive is 80 % fine, where the 20 % left is 1) features like searching in metadata, 2) the possibility to resuscitate the forum |
18:06
🔗
|
joepie91 |
odie5533: `ia` or ias3upload |
18:07
🔗
|
Nemo_bis |
odie5533: yes, http://www.archive.org/help/abouts3.txt |
18:07
🔗
|
Nemo_bis |
anf it's many files, https://github.com/kngenie/ias3upload |
18:07
🔗
|
joepie91 |
odie5533: https://pypi.python.org/pypi/internetarchive |
18:07
🔗
|
joepie91 |
might also be useful |
18:08
🔗
|
odie5533 |
thanks, I think I'll try the ia tool |
18:12
🔗
|
bsmith093 |
SketchCow: help me settle an argument in the reviews for my superman item. if IA gets a takedown notice for a thing, do they delete the thing or just dark it? |
18:13
🔗
|
odie5533 |
Is IA internally similar to S3, or only the API? |
18:17
🔗
|
joepie91 |
bsmith093: afaik they are darked |
18:17
🔗
|
bsmith093 |
joepie91: I thought so, thanks |
18:18
🔗
|
SmileyG |
just dark it bsmith093 |
18:18
🔗
|
SmileyG |
everything goes dark |
18:19
🔗
|
SmileyG |
no law against storing afaik |
18:20
🔗
|
phillipsj |
Unless you are megaupload. |
18:21
🔗
|
SmileyG |
lol |
18:33
🔗
|
odie5533 |
they got in trouble for encouraging illegal usage. |
18:33
🔗
|
odie5533 |
and for profiting from it |
18:34
🔗
|
balrog |
meh I just fixed the yahoogroup dumper :) |
18:35
🔗
|
joepie91 |
the internetarchive python module existing, means that very soon we will have daily dumps of pastebin |
18:35
🔗
|
joepie91 |
:_) |
18:35
🔗
|
joepie91 |
:) * |
18:37
🔗
|
balrog |
https://github.com/balr0g/grabyahoogroup if someone wants to dump a yahoo group :) |
18:39
🔗
|
SketchCow |
bsmith093: Darked. |
18:39
🔗
|
odie5533 |
balrog: ooh, perl |
18:39
🔗
|
bsmith093 |
eveyrone joepie91 SketchCow thanks , i figured that was it |
18:39
🔗
|
balrog |
odie5533: guess how yahoo broke it? |
18:39
🔗
|
balrog |
https://github.com/balr0g/grabyahoogroup/commit/351cb8bdc7f1a2c27dafa6757b154622685f15fc |
18:40
🔗
|
odie5533 |
('|")? |
18:40
🔗
|
odie5533 |
it switched from single to double quotes? |
18:40
🔗
|
balrog |
no, they alternate between single and double quotes |
18:40
🔗
|
odie5533 |
ah |
18:40
🔗
|
balrog |
randomly |
18:40
🔗
|
joepie91 |
lol |
18:40
🔗
|
odie5533 |
you should be using a character set thing: ['"] |
18:41
🔗
|
balrog |
ah yeah you're right |
18:41
🔗
|
balrog |
well it really doesn't matter |
18:41
🔗
|
balrog |
what I did is the first thing I thought of and it's equivalent |
18:42
🔗
|
odie5533 |
not for your regex, but if you were doing replacements/extractions it would |
18:42
🔗
|
balrog |
it would be nice if I could make sure it's either both single or double |
18:42
🔗
|
balrog |
so it wouldn't pass 'blah" |
18:42
🔗
|
odie5533 |
huh? |
18:42
🔗
|
balrog |
right now, it will match 'blah" |
18:43
🔗
|
odie5533 |
Did you write that whole script? |
18:43
🔗
|
Baljem |
in that case you want your grouping and a backreference, I think... ('|")otherstuff\1 perhaps? or depending on what the otherstuff is, it might be clearer to just ('otherstuff'|"otherstuff") |
18:43
🔗
|
balrog |
no, someone else did; I just fixed it up |
18:44
🔗
|
odie5533 |
ah |
18:44
🔗
|
balrog |
since I've become more comfortable with perl and regex since the last time I looked at it (back in january) |
18:44
🔗
|
odie5533 |
that is one giant ball of perl. |
18:44
🔗
|
balrog |
lol yep. |
18:45
🔗
|
balrog |
but it works and I don't feel like rewriting it. |
18:45
🔗
|
odie5533 |
What does it output? |
18:45
🔗
|
odie5533 |
A warc file? |
18:46
🔗
|
balrog |
no |
18:46
🔗
|
balrog |
separate .txt files for messages in mail message format |
18:46
🔗
|
balrog |
which it will produce an mbox from |
18:47
🔗
|
balrog |
directories of files for downloads and attachments, etc |
18:47
🔗
|
balrog |
the big annoying part is that to see anything except messages you need login/pass |
18:47
🔗
|
balrog |
for ANY group |
18:47
🔗
|
balrog |
and for some groups you need login/pass for anything at all |
18:47
🔗
|
balrog |
AND, most groups require admin approval |
18:48
🔗
|
balrog |
if all the admins are gone, you're outta luck |
18:48
🔗
|
odie5533 |
Yeah, yahoo groups is sorta closed like htat. |
18:50
🔗
|
odie5533 |
IA uses them though. |
18:50
🔗
|
joepie91 |
I find slight irony in IA using a Yahoo service |
18:51
🔗
|
odie5533 |
Why? It's not like Yahoo would up and delete one of their poducts. They've never done anything like that before. |
18:52
🔗
|
* |
joepie91 tacks a "sarcasm" tag onto there in case anyone didn't catch it yet |
18:54
🔗
|
SmileyG |
:D |
19:06
🔗
|
odie5533 |
How can I add things to the ArchiveTeam collection on IA? |
19:07
🔗
|
odie5533 |
or set media type to web. |
21:03
🔗
|
ersi |
You can't. |
21:03
🔗
|
ersi |
SketchCow needs to add you as an admin of the collection before you're able to do that |
21:03
🔗
|
ersi |
Or just alert him of the items that needs moving at some time |
21:18
🔗
|
godane |
i'm grabbing d-addicts.com forum posts |
21:29
🔗
|
godane |
i'm using a more brute force grab to make it quick |
22:18
🔗
|
odie5533 |
ersi: Is that the same for all collections? Should items be sorted, or just uploaded to the Videos/Audios/Texts only? |
22:19
🔗
|
ersi |
odie5533: You only have permission to the Community * collections by default |
22:58
🔗
|
Nemo_bis |
mediatype is free though |
22:58
🔗
|
Nemo_bis |
(but only via s3), so do set it correctly, it makes your items prettier and derivation less stupid |