Time |
Nickname |
Message |
00:23
๐
|
Nemo_bis |
no repo to clone? http://archiveteam.org/index.php?title=AOL_Music |
02:27
๐
|
chfoo |
i added a link to an archive on that aol music page. not sure if the project is finished though... |
02:29
๐
|
chfoo |
i also have a list of 50 items in archive.org that aren't in the archiveteam collection too |
08:52
๐
|
arkiver |
hmm |
08:52
๐
|
arkiver |
wallbase.cc is harder then I thought |
08:52
๐
|
arkiver |
I can quickly download 100 pages in the beginning, but then it slows down |
08:53
๐
|
arkiver |
I think they have some kind of security which makes it slow down... :( |
08:55
๐
|
aggrosk |
Yeah. You'll have to apply some cleverness. The admin has put at least some work into making scraping difficult. I remember when I was trying to grab some wall papers from a specific search, that the server seems to check the Referrer header value to make sure you got to the raw link from the respective page and wouldn't return the image otherwise. In any case, you'd want to rate limit your script. |
09:00
๐
|
arkiver |
yeah... |
09:00
๐
|
arkiver |
well |
09:00
๐
|
arkiver |
it is going |
09:01
๐
|
aggrosk |
Is wallbase shutting down? Or is this just a pre-emptive grab? |
09:02
๐
|
arkiver |
well |
09:02
๐
|
arkiver |
their forums have shut down |
09:03
๐
|
arkiver |
they say theior website is staying online, but... yeah |
09:03
๐
|
aggrosk |
Which might not bode well for the rest of the service. Huh. |
09:03
๐
|
arkiver |
just in case |
09:03
๐
|
arkiver |
I'm now doing 58 per minute |
09:03
๐
|
arkiver |
links |
09:03
๐
|
aggrosk |
Is there a repo up or a page on the wiki? |
09:03
๐
|
arkiver |
64 per minute |
09:04
๐
|
arkiver |
I don't know... |
09:04
๐
|
arkiver |
you can check for that |
09:04
๐
|
aggrosk |
Don't see anything in either spot, though it does look like there was a reason "panic" grab here: https://archive.org/details/wallpapers.wallbase.cc-rozne-wallpaper-jpg-1-to-100000-20140130 |
09:04
๐
|
arkiver |
yes |
09:04
๐
|
arkiver |
that one was done godane |
09:05
๐
|
arkiver |
just a small portion of the website |
09:05
๐
|
arkiver |
I'mn doing everything |
09:05
๐
|
aggrosk |
Cool. I'll add something to the wiki at least. |
09:05
๐
|
arkiver |
:) |
09:05
๐
|
arkiver |
thank you! |
09:08
๐
|
arkiver |
yipdw: zoom works!!! :D |
09:24
๐
|
aggrosk |
http://archiveteam.org/index.php?title=Wallbase ; consider that a starting point. Y'all can add what you need to it. |
09:31
๐
|
arkiver |
Thank you aggrosk |
09:32
๐
|
aggrosk |
Np. Just updated with some info from the FB and twitter pages. |
09:32
๐
|
aggrosk |
Looks like the owner is MIA. |
09:33
๐
|
arkiver |
average of 59 pages per minute now |
09:34
๐
|
aggrosk |
You ought to upload your code to the wiki at least. Or at least link to it. |
12:41
๐
|
Nemo_bis |
https://archive.org/post/1010731/check-for-hash-md5-or-sha1-to-search-item-or-verify-s3-upload |
13:02
๐
|
joepie91 |
Nemo_bis: please update me on that, if there's any responses |
13:02
๐
|
joepie91 |
:p |
13:02
๐
|
joepie91 |
I kinda need hash search |
13:06
๐
|
arkiver |
would be very helpfull indeed |
13:15
๐
|
Nemo_bis |
joepie91: add a comment and you'll be notified by email if there are replies, no? ;) |
13:31
๐
|
joepie91 |
Nemo_bis: oh, no idea, not familiar with how IA forums work |
13:31
๐
|
joepie91 |
:P |
13:34
๐
|
joepie91 |
Nemo_bis: |
13:34
๐
|
joepie91 |
"Iรยขรยรยd like to receive email when someone responds to this post" |
13:34
๐
|
joepie91 |
seems to only apply to replies to me |
13:34
๐
|
joepie91 |
not to eht thread |
13:34
๐
|
joepie91 |
the * |
14:13
๐
|
arkiver |
I will be able to do aroud 40000 link every 12 hours from wallbase.cc |
14:13
๐
|
arkiver |
that's because of their "limit" |
14:15
๐
|
Dud2 |
What size will that archive be in total? and how many links are there in total? |
15:56
๐
|
Nemo_bis |
hashlib in python seems awfully slow |
15:57
๐
|
Nemo_bis |
Not that I'm surprised, but I hoped at most one order of magnitude slower than md5sum. |
17:26
๐
|
Nemo_bis |
:( <Error><Code>SlowDown</Code><Message>Please reduce your request rate.</Message><Resource/><RequestId>5a9f5f99-5658-4959-ab01-ac958cb203eb</RequestId></Error> |
17:31
๐
|
joepie91 |
Nemo_bis: aws? |
17:32
๐
|
Nemo_bis |
joepie91: no, IA |
17:32
๐
|
joepie91 |
ah |
17:32
๐
|
joepie91 |
similar |
17:32
๐
|
joepie91 |
didn't know they had rate limiting though |
17:32
๐
|
joepie91 |
:P |
17:36
๐
|
Nemo_bis |
Not the first time I hit their limit, but it's rare enough. |
17:38
๐
|
Nemo_bis |
Most often, it happens because there are too many waiting tasks on the item in question. |
17:39
๐
|
Nemo_bis |
The queue is half of what it was a couple days ago, why complain? :P https://archive.org/~tracey/mrtg/derivesg.html |
18:32
๐
|
Nemo_bis |
joepie91: there's already an answer by the awesome Jeff. :) Though I'm not sure the docs he linked contain an answer to the first question. |
18:32
๐
|
joepie91 |
Nemo_bis: thanks! |
19:31
๐
|
arkiver |
Dud2: I'm not sure how big it will be... Maybe 300GB to 1 TB? |
19:31
๐
|
arkiver |
something around that I think |
19:31
๐
|
arkiver |
but I'll make it |
19:31
๐
|
arkiver |
they say they don't have plans to shut the website down |
19:31
๐
|
arkiver |
so there should be some time |
19:32
๐
|
Dud1 |
Okay, I was going to offer to help, but with that size I don't think I would be able. |
19:42
๐
|
arkiver |
Dud1: ah, thank you anyway! :) |
19:42
๐
|
arkiver |
you are experienced with making crawls? |
19:44
๐
|
Dud1 |
Nope, but willing learn. |
20:34
๐
|
Nemo_bis |
eek, by checking hash I found 123 wikis where upload had failed. :( |
20:43
๐
|
Nemo_bis |
Ouch, many of those are readonly servers. |