Time |
Nickname |
Message |
00:27
π
|
|
kristian_ has quit IRC (Quit: Leaving) |
00:44
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
00:55
π
|
|
dashcloud has joined #archiveteam-bs |
00:56
π
|
|
Soni has quit IRC (Ping timeout: 272 seconds) |
00:58
π
|
|
Soni has joined #archiveteam-bs |
02:11
π
|
|
balrog has quit IRC (Read error: Operation timed out) |
02:13
π
|
|
balrog has joined #archiveteam-bs |
02:13
π
|
|
swebb sets mode: +o balrog |
02:23
π
|
|
ftpdingus has joined #archiveteam-bs |
02:23
π
|
ftpdingus |
Anyone alive? |
02:43
π
|
|
ZexaronS- has quit IRC (Read error: Operation timed out) |
02:59
π
|
|
ZexaronS has joined #archiveteam-bs |
03:18
π
|
|
ZexaronS has quit IRC (Read error: Operation timed out) |
03:45
π
|
|
ZexaronS has joined #archiveteam-bs |
03:55
π
|
|
godane has quit IRC (Read error: Operation timed out) |
04:27
π
|
ndiddy |
yeah |
04:27
π
|
|
ftpdingus has quit IRC (Read error: Connection reset by peer) |
04:27
π
|
ndiddy |
oh |
04:28
π
|
|
ftpdingus has joined #archiveteam-bs |
04:37
π
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:40
π
|
Somebody2 |
oh? |
04:44
π
|
|
Sk1d has joined #archiveteam-bs |
04:52
π
|
astrid |
oh. |
05:29
π
|
|
divingkat has joined #archiveteam-bs |
05:30
π
|
divingkat |
http://miscdatadigs.shoutwiki.com/ |
05:30
π
|
divingkat |
The TCRF of regular everyday computer programs |
05:30
π
|
divingkat |
Or something. |
05:30
π
|
divingkat |
Basically digging through programs, apps, etc. to find interesting things in them. |
05:31
π
|
divingkat |
Or at least, uncommon knowledge. |
06:06
π
|
|
BlueMaxim has joined #archiveteam-bs |
06:12
π
|
|
ftpdingus has quit IRC (Ping timeout: 255 seconds) |
07:03
π
|
|
mls has quit IRC (Ping timeout: 250 seconds) |
07:11
π
|
|
mls has joined #archiveteam-bs |
08:04
π
|
|
divingkat has quit IRC (Leaving) |
10:33
π
|
|
Soni has quit IRC (Ping timeout: 272 seconds) |
10:33
π
|
|
Soni has joined #archiveteam-bs |
11:12
π
|
JAA |
Whoa. I managed to trigger an infinite loop in wpull. It tried to retrieve the same broken URL (NXDOMAIN) over 6000 times although I specified --tries 3. |
12:00
π
|
ranma |
https://board.byuu.org/viewtopic.php?f=6&t=1804 |
12:26
π
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
13:16
π
|
|
midas_ has joined #archiveteam-bs |
13:19
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
13:19
π
|
|
zhongfu has joined #archiveteam-bs |
14:04
π
|
|
kristian_ has joined #archiveteam-bs |
14:34
π
|
|
brayden_ has joined #archiveteam-bs |
14:34
π
|
|
swebb sets mode: +o brayden_ |
14:37
π
|
|
brayden has quit IRC (Ping timeout: 255 seconds) |
14:37
π
|
|
brayden_ is now known as brayden |
14:38
π
|
|
brayden has left Closing Window |
14:38
π
|
|
brayden has joined #archiveteam-bs |
14:38
π
|
|
swebb sets mode: +o brayden |
14:44
π
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
15:25
π
|
|
kristian_ has quit IRC (Quit: Leaving) |
15:27
π
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
15:41
π
|
|
robink has joined #archiveteam-bs |
15:41
π
|
|
robinak has quit IRC (Read error: Connection reset by peer) |
16:20
π
|
|
godane has joined #archiveteam-bs |
16:20
π
|
godane |
so i had to find the irc.efnet.org ip address to connect to there |
16:21
π
|
godane |
*here |
16:22
π
|
godane |
i kept on getting a not resolve error in pidgin |
17:11
π
|
|
schbirid has joined #archiveteam-bs |
17:13
π
|
JAA |
Can confirm, irc.efnet.org does not resolve for me currently. Looks like EFNet's name servers are down. |
17:16
π
|
godane |
ok then |
17:16
π
|
godane |
at least its not comcast maybe |
17:17
π
|
JAA |
Definitely not, unless the EFNet admins are suicidal and host their name servers at Comcast. |
17:37
π
|
JAA |
So, regarding Dead Format: I'm planning on creating a seed list and letting ArchiveBot take it from there. |
17:37
π
|
JAA |
Artists, labels, and users are all not enumerable as far as I can see. Releases are, however, although there is also a slug in the URL. So I've grabbed all releases (IDs 1 to 23k) and will extract the artist, label, and user links from those for seeding. This will, however only cover a tiny part of the website. |
17:37
π
|
JAA |
My second idea was search engines, but DDG and Searx didn't return many results. |
17:38
π
|
JAA |
Another plan would be bruteforcing the collection search, which appears to be basically the only way to really discover content on that site. |
17:38
π
|
|
schbirid2 has joined #archiveteam-bs |
17:39
π
|
astrid |
<3 |
17:42
π
|
JAA |
I found a list of the 10k most common English words, that sounds like a good start. I'll also try to find some lists in other important languages. |
17:43
π
|
JAA |
If anyone has other ideas, let me know. |
17:43
π
|
JAA |
(Or such word lists.) |
17:44
π
|
JAA |
It looks like their search fails when there are too many results. You can't search for "the", for example. |
17:45
π
|
|
schbirid has quit IRC (Read error: Operation timed out) |
17:46
π
|
joepie91 |
JAA: try wildcards |
17:46
π
|
joepie91 |
also |
17:47
π
|
joepie91 |
sec |
17:47
π
|
JAA |
Ooh, nice one. "Search for t* in Haves returned 544,902 items" |
17:48
π
|
JAA |
Unfortunately, this is incredibly slow, i.e. causes very high load on their server. |
17:48
π
|
JAA |
But it's a very good idea. |
17:52
π
|
joepie91 |
JAA: will have more for you in a few mins |
17:52
π
|
joepie91 |
hold |
17:52
π
|
JAA |
Huh, the search is funny. Searching "entrée" or "entrèe" both returns the same result, but "entree" doesn't. |
17:52
π
|
JAA |
Ok |
18:16
π
|
JAA |
So my releases grab yielded 8355 artists, 3743 labels, and 1624 users. |
18:24
π
|
JAA |
Note for myself: the ArchiveBot job will have to follow external links to grab the full-resolution images. |
18:26
π
|
joepie91 |
JAA: https://github.com/joepie91/webshots |
18:26
π
|
joepie91 |
JAA: in particular https://github.com/joepie91/webshots/blob/master/search.py |
18:26
π
|
joepie91 |
note the last commit description |
18:26
π
|
joepie91 |
JAA: does wildcard searches that are automatically refined when it detects an incomplete result set, and un-refines afterwards |
18:27
π
|
joepie91 |
you'd need to change the exact parameter for determining 'incomplete result set' (you need 'too many results error' instead of 'X amount of results') but otherwise the idea is the same |
18:28
π
|
joepie91 |
JAA: tl;dr obtains a complete set with the minimum amount of search requests |
18:28
π
|
JAA |
Oh, neat. |
18:28
π
|
JAA |
Thanks! I have to leave now, but I'll take a closer look later. |
18:28
π
|
JAA |
This sounds perfect. |
18:29
π
|
joepie91 |
JAA: it was, incidentally, for an older archiveteam project :) |
18:35
π
|
godane |
SketchCow: here are some items are having trouble : https://pastebin.com/W6mkfij5 |
18:35
π
|
godane |
at least 2 of the items history i have check don't say its dark |
18:35
π
|
godane |
*history log |
18:36
π
|
|
bwn has quit IRC (Read error: Operation timed out) |
19:19
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
19:20
π
|
|
dashcloud has joined #archiveteam-bs |
19:24
π
|
|
kristian_ has joined #archiveteam-bs |
19:28
π
|
JAA |
joepie91: How surprising. :-P |
19:32
π
|
|
kristian_ has quit IRC (Ping timeout: 370 seconds) |
19:32
π
|
|
kristian_ has joined #archiveteam-bs |
19:38
π
|
|
Stilett0- has joined #archiveteam-bs |
19:38
π
|
|
ruunyan has quit IRC (Ping timeout: 250 seconds) |
19:44
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
19:47
π
|
JAA |
So, regarding the entrée/entrèe situation: the search completely ignores the special characters, splits it into two words, and probably ignores "e" as a search word. "entr" obviously matches the entry which contains "entrèe". |
19:48
π
|
|
dashcloud has joined #archiveteam-bs |
19:49
π
|
JAA |
Interestingly, you can't search for "esp", but "espΓ©" works (but is not the same as "esp*"). |
20:26
π
|
|
kristian_ has quit IRC (Read error: Operation timed out) |
20:33
π
|
|
kristian_ has joined #archiveteam-bs |
20:36
π
|
|
kristian_ has quit IRC (Client Quit) |
20:42
π
|
JAA |
Huh, I found a query which returns over one third of all "items" (= collection entries, as far as I can tell). I probably don't have to mention how slow it is. |
20:43
π
|
JAA |
I don't have the slightest idea why it works though. I entered "*a-b*". |
21:16
π
|
|
schbirid2 has quit IRC (Quit: Leaving) |
21:20
π
|
|
noirscape has quit IRC (Read error: Connection reset by peer) |
22:11
π
|
joepie91 |
lol |
22:18
π
|
|
bwn has joined #archiveteam-bs |
22:28
π
|
JAA |
My scrape is running now. I'm simply bruteforcing all three-character prefixes, i.e. aaa*, aab*, ... zzz*. |
22:29
π
|
JAA |
I ended up writing my own code since the specifics were quite a bit different than in your case and I figured I'd be faster this way. |
22:36
π
|
|
Stilett0- is now known as Stiletto |
23:09
π
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
23:56
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
23:57
π
|
|
dashcloud has joined #archiveteam-bs |