#archiveteam-bs 2017-09-26,Tue

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***kristian_ has quit IRC (Quit: Leaving) [00:27]
.... (idle for 17mn)
dashcloud has quit IRC (Read error: Operation timed out) [00:44]
dashcloud has joined #archiveteam-bs
Soni has quit IRC (Ping timeout: 272 seconds)
Soni has joined #archiveteam-bs
[00:55]
............... (idle for 1h13mn)
balrog has quit IRC (Read error: Operation timed out)
balrog has joined #archiveteam-bs
swebb sets mode: +o balrog
[02:11]
ftpdingus has joined #archiveteam-bs [02:23]
ftpdingusAnyone alive? [02:23]
..... (idle for 20mn)
***ZexaronS- has quit IRC (Read error: Operation timed out) [02:43]
.... (idle for 16mn)
ZexaronS has joined #archiveteam-bs [02:59]
.... (idle for 19mn)
ZexaronS has quit IRC (Read error: Operation timed out) [03:18]
...... (idle for 27mn)
ZexaronS has joined #archiveteam-bs [03:45]
godane has quit IRC (Read error: Operation timed out) [03:55]
....... (idle for 32mn)
ndiddyyeah [04:27]
***ftpdingus has quit IRC (Read error: Connection reset by peer) [04:27]
ndiddyoh [04:27]
***ftpdingus has joined #archiveteam-bs [04:28]
Sk1d has quit IRC (Ping timeout: 250 seconds) [04:37]
Somebody2oh? [04:40]
***Sk1d has joined #archiveteam-bs [04:44]
astridoh. [04:52]
........ (idle for 37mn)
***divingkat has joined #archiveteam-bs [05:29]
divingkathttp://miscdatadigs.shoutwiki.com/
The TCRF of regular everyday computer programs
Or something.
Basically digging through programs, apps, etc. to find interesting things in them.
Or at least, uncommon knowledge.
[05:30]
........ (idle for 35mn)
***BlueMaxim has joined #archiveteam-bs [06:06]
ftpdingus has quit IRC (Ping timeout: 255 seconds) [06:12]
........... (idle for 51mn)
mls has quit IRC (Ping timeout: 250 seconds) [07:03]
mls has joined #archiveteam-bs [07:11]
........... (idle for 53mn)
divingkat has quit IRC (Leaving) [08:04]
.............................. (idle for 2h29mn)
Soni has quit IRC (Ping timeout: 272 seconds)
Soni has joined #archiveteam-bs
[10:33]
........ (idle for 39mn)
JAAWhoa. I managed to trigger an infinite loop in wpull. It tried to retrieve the same broken URL (NXDOMAIN) over 6000 times although I specified --tries 3. [11:12]
.......... (idle for 48mn)
ranmahttps://board.byuu.org/viewtopic.php?f=6&t=1804 [12:00]
...... (idle for 26mn)
***BlueMaxim has quit IRC (Read error: Connection reset by peer) [12:26]
........... (idle for 50mn)
midas_ has joined #archiveteam-bs
zhongfu has quit IRC (Ping timeout: 260 seconds)
zhongfu has joined #archiveteam-bs
[13:16]
.......... (idle for 45mn)
kristian_ has joined #archiveteam-bs [14:04]
....... (idle for 30mn)
brayden_ has joined #archiveteam-bs
swebb sets mode: +o brayden_
brayden has quit IRC (Ping timeout: 255 seconds)
brayden_ is now known as brayden
brayden has left Closing Window
brayden has joined #archiveteam-bs
swebb sets mode: +o brayden
[14:34]
Mateon1 has quit IRC (Read error: Operation timed out) [14:44]
......... (idle for 41mn)
kristian_ has quit IRC (Quit: Leaving)
Stiletto has quit IRC (Read error: Operation timed out)
[15:25]
robink has joined #archiveteam-bs
robinak has quit IRC (Read error: Connection reset by peer)
[15:41]
........ (idle for 39mn)
godane has joined #archiveteam-bs [16:20]
godaneso i had to find the irc.efnet.org ip address to connect to there
*here
i kept on getting a not resolve error in pidgin
[16:20]
.......... (idle for 49mn)
***schbirid has joined #archiveteam-bs [17:11]
JAACan confirm, irc.efnet.org does not resolve for me currently. Looks like EFNet's name servers are down. [17:13]
godaneok then
at least its not comcast maybe
[17:16]
JAADefinitely not, unless the EFNet admins are suicidal and host their name servers at Comcast. [17:17]
..... (idle for 20mn)
So, regarding Dead Format: I'm planning on creating a seed list and letting ArchiveBot take it from there.
Artists, labels, and users are all not enumerable as far as I can see. Releases are, however, although there is also a slug in the URL. So I've grabbed all releases (IDs 1 to 23k) and will extract the artist, label, and user links from those for seeding. This will, however only cover a tiny part of the website.
My second idea was search engines, but DDG and Searx didn't return many results.
Another plan would be bruteforcing the collection search, which appears to be basically the only way to really discover content on that site.
[17:37]
***schbirid2 has joined #archiveteam-bs [17:38]
astrid<3 [17:39]
JAAI found a list of the 10k most common English words, that sounds like a good start. I'll also try to find some lists in other important languages.
If anyone has other ideas, let me know.
(Or such word lists.)
It looks like their search fails when there are too many results. You can't search for "the", for example.
[17:42]
***schbirid has quit IRC (Read error: Operation timed out) [17:45]
joepie91JAA: try wildcards
also
sec
[17:46]
JAAOoh, nice one. "Search for t* in Haves returned 544,902 items"
Unfortunately, this is incredibly slow, i.e. causes very high load on their server.
But it's a very good idea.
[17:47]
joepie91JAA: will have more for you in a few mins
hold
[17:52]
JAAHuh, the search is funny. Searching "entrée" or "entrèe" both returns the same result, but "entree" doesn't.
Ok
[17:52]
..... (idle for 24mn)
So my releases grab yielded 8355 artists, 3743 labels, and 1624 users. [18:16]
Note for myself: the ArchiveBot job will have to follow external links to grab the full-resolution images. [18:24]
joepie91JAA: https://github.com/joepie91/webshots
JAA: in particular https://github.com/joepie91/webshots/blob/master/search.py
note the last commit description
JAA: does wildcard searches that are automatically refined when it detects an incomplete result set, and un-refines afterwards
you'd need to change the exact parameter for determining 'incomplete result set' (you need 'too many results error' instead of 'X amount of results') but otherwise the idea is the same
JAA: tl;dr obtains a complete set with the minimum amount of search requests
[18:26]
JAAOh, neat.
Thanks! I have to leave now, but I'll take a closer look later.
This sounds perfect.
[18:28]
joepie91JAA: it was, incidentally, for an older archiveteam project :) [18:29]
godaneSketchCow: here are some items are having trouble : https://pastebin.com/W6mkfij5
at least 2 of the items history i have check don't say its dark
*history log
[18:35]
***bwn has quit IRC (Read error: Operation timed out) [18:36]
......... (idle for 43mn)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
kristian_ has joined #archiveteam-bs
[19:19]
JAAjoepie91: How surprising. :-P [19:28]
***kristian_ has quit IRC (Ping timeout: 370 seconds)
kristian_ has joined #archiveteam-bs
[19:32]
Stilett0- has joined #archiveteam-bs
ruunyan has quit IRC (Ping timeout: 250 seconds)
[19:38]
dashcloud has quit IRC (Read error: Operation timed out) [19:44]
JAASo, regarding the entrée/entrèe situation: the search completely ignores the special characters, splits it into two words, and probably ignores "e" as a search word. "entr" obviously matches the entry which contains "entrèe". [19:47]
***dashcloud has joined #archiveteam-bs [19:48]
JAAInterestingly, you can't search for "esp", but "espé" works (but is not the same as "esp*"). [19:49]
........ (idle for 37mn)
***kristian_ has quit IRC (Read error: Operation timed out) [20:26]
kristian_ has joined #archiveteam-bs
kristian_ has quit IRC (Client Quit)
[20:33]
JAAHuh, I found a query which returns over one third of all "items" (= collection entries, as far as I can tell). I probably don't have to mention how slow it is.
I don't have the slightest idea why it works though. I entered "*a-b*".
[20:42]
....... (idle for 33mn)
***schbirid2 has quit IRC (Quit: Leaving)
noirscape has quit IRC (Read error: Connection reset by peer)
[21:16]
........... (idle for 51mn)
joepie91lol [22:11]
***bwn has joined #archiveteam-bs [22:18]
JAAMy scrape is running now. I'm simply bruteforcing all three-character prefixes, i.e. aaa*, aab*, ... zzz*.
I ended up writing my own code since the specifics were quite a bit different than in your case and I figured I'd be faster this way.
[22:28]
***Stilett0- is now known as Stiletto [22:36]
....... (idle for 33mn)
Stiletto has quit IRC (Read error: Operation timed out) [23:09]
.......... (idle for 47mn)
dashcloud has quit IRC (Read error: Operation timed out)
dashcloud has joined #archiveteam-bs
[23:56]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)