| Time |
Nickname |
Message |
|
00:27
π
|
|
kristian_ has quit IRC (Quit: Leaving) |
|
00:44
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
00:55
π
|
|
dashcloud has joined #archiveteam-bs |
|
00:56
π
|
|
Soni has quit IRC (Ping timeout: 272 seconds) |
|
00:58
π
|
|
Soni has joined #archiveteam-bs |
|
02:11
π
|
|
balrog has quit IRC (Read error: Operation timed out) |
|
02:13
π
|
|
balrog has joined #archiveteam-bs |
|
02:13
π
|
|
swebb sets mode: +o balrog |
|
02:23
π
|
|
ftpdingus has joined #archiveteam-bs |
|
02:23
π
|
ftpdingus |
Anyone alive? |
|
02:43
π
|
|
ZexaronS- has quit IRC (Read error: Operation timed out) |
|
02:59
π
|
|
ZexaronS has joined #archiveteam-bs |
|
03:18
π
|
|
ZexaronS has quit IRC (Read error: Operation timed out) |
|
03:45
π
|
|
ZexaronS has joined #archiveteam-bs |
|
03:55
π
|
|
godane has quit IRC (Read error: Operation timed out) |
|
04:27
π
|
ndiddy |
yeah |
|
04:27
π
|
|
ftpdingus has quit IRC (Read error: Connection reset by peer) |
|
04:27
π
|
ndiddy |
oh |
|
04:28
π
|
|
ftpdingus has joined #archiveteam-bs |
|
04:37
π
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
|
04:40
π
|
Somebody2 |
oh? |
|
04:44
π
|
|
Sk1d has joined #archiveteam-bs |
|
04:52
π
|
astrid |
oh. |
|
05:29
π
|
|
divingkat has joined #archiveteam-bs |
|
05:30
π
|
divingkat |
http://miscdatadigs.shoutwiki.com/ |
|
05:30
π
|
divingkat |
The TCRF of regular everyday computer programs |
|
05:30
π
|
divingkat |
Or something. |
|
05:30
π
|
divingkat |
Basically digging through programs, apps, etc. to find interesting things in them. |
|
05:31
π
|
divingkat |
Or at least, uncommon knowledge. |
|
06:06
π
|
|
BlueMaxim has joined #archiveteam-bs |
|
06:12
π
|
|
ftpdingus has quit IRC (Ping timeout: 255 seconds) |
|
07:03
π
|
|
mls has quit IRC (Ping timeout: 250 seconds) |
|
07:11
π
|
|
mls has joined #archiveteam-bs |
|
08:04
π
|
|
divingkat has quit IRC (Leaving) |
|
10:33
π
|
|
Soni has quit IRC (Ping timeout: 272 seconds) |
|
10:33
π
|
|
Soni has joined #archiveteam-bs |
|
11:12
π
|
JAA |
Whoa. I managed to trigger an infinite loop in wpull. It tried to retrieve the same broken URL (NXDOMAIN) over 6000 times although I specified --tries 3. |
|
12:00
π
|
ranma |
https://board.byuu.org/viewtopic.php?f=6&t=1804 |
|
12:26
π
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
|
13:16
π
|
|
midas_ has joined #archiveteam-bs |
|
13:19
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
|
13:19
π
|
|
zhongfu has joined #archiveteam-bs |
|
14:04
π
|
|
kristian_ has joined #archiveteam-bs |
|
14:34
π
|
|
brayden_ has joined #archiveteam-bs |
|
14:34
π
|
|
swebb sets mode: +o brayden_ |
|
14:37
π
|
|
brayden has quit IRC (Ping timeout: 255 seconds) |
|
14:37
π
|
|
brayden_ is now known as brayden |
|
14:38
π
|
|
brayden has left Closing Window |
|
14:38
π
|
|
brayden has joined #archiveteam-bs |
|
14:38
π
|
|
swebb sets mode: +o brayden |
|
14:44
π
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
|
15:25
π
|
|
kristian_ has quit IRC (Quit: Leaving) |
|
15:27
π
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
|
15:41
π
|
|
robink has joined #archiveteam-bs |
|
15:41
π
|
|
robinak has quit IRC (Read error: Connection reset by peer) |
|
16:20
π
|
|
godane has joined #archiveteam-bs |
|
16:20
π
|
godane |
so i had to find the irc.efnet.org ip address to connect to there |
|
16:21
π
|
godane |
*here |
|
16:22
π
|
godane |
i kept on getting a not resolve error in pidgin |
|
17:11
π
|
|
schbirid has joined #archiveteam-bs |
|
17:13
π
|
JAA |
Can confirm, irc.efnet.org does not resolve for me currently. Looks like EFNet's name servers are down. |
|
17:16
π
|
godane |
ok then |
|
17:16
π
|
godane |
at least its not comcast maybe |
|
17:17
π
|
JAA |
Definitely not, unless the EFNet admins are suicidal and host their name servers at Comcast. |
|
17:37
π
|
JAA |
So, regarding Dead Format: I'm planning on creating a seed list and letting ArchiveBot take it from there. |
|
17:37
π
|
JAA |
Artists, labels, and users are all not enumerable as far as I can see. Releases are, however, although there is also a slug in the URL. So I've grabbed all releases (IDs 1 to 23k) and will extract the artist, label, and user links from those for seeding. This will, however only cover a tiny part of the website. |
|
17:37
π
|
JAA |
My second idea was search engines, but DDG and Searx didn't return many results. |
|
17:38
π
|
JAA |
Another plan would be bruteforcing the collection search, which appears to be basically the only way to really discover content on that site. |
|
17:38
π
|
|
schbirid2 has joined #archiveteam-bs |
|
17:39
π
|
astrid |
<3 |
|
17:42
π
|
JAA |
I found a list of the 10k most common English words, that sounds like a good start. I'll also try to find some lists in other important languages. |
|
17:43
π
|
JAA |
If anyone has other ideas, let me know. |
|
17:43
π
|
JAA |
(Or such word lists.) |
|
17:44
π
|
JAA |
It looks like their search fails when there are too many results. You can't search for "the", for example. |
|
17:45
π
|
|
schbirid has quit IRC (Read error: Operation timed out) |
|
17:46
π
|
joepie91 |
JAA: try wildcards |
|
17:46
π
|
joepie91 |
also |
|
17:47
π
|
joepie91 |
sec |
|
17:47
π
|
JAA |
Ooh, nice one. "Search for t* in Haves returned 544,902 items" |
|
17:48
π
|
JAA |
Unfortunately, this is incredibly slow, i.e. causes very high load on their server. |
|
17:48
π
|
JAA |
But it's a very good idea. |
|
17:52
π
|
joepie91 |
JAA: will have more for you in a few mins |
|
17:52
π
|
joepie91 |
hold |
|
17:52
π
|
JAA |
Huh, the search is funny. Searching "entrée" or "entrèe" both returns the same result, but "entree" doesn't. |
|
17:52
π
|
JAA |
Ok |
|
18:16
π
|
JAA |
So my releases grab yielded 8355 artists, 3743 labels, and 1624 users. |
|
18:24
π
|
JAA |
Note for myself: the ArchiveBot job will have to follow external links to grab the full-resolution images. |
|
18:26
π
|
joepie91 |
JAA: https://github.com/joepie91/webshots |
|
18:26
π
|
joepie91 |
JAA: in particular https://github.com/joepie91/webshots/blob/master/search.py |
|
18:26
π
|
joepie91 |
note the last commit description |
|
18:26
π
|
joepie91 |
JAA: does wildcard searches that are automatically refined when it detects an incomplete result set, and un-refines afterwards |
|
18:27
π
|
joepie91 |
you'd need to change the exact parameter for determining 'incomplete result set' (you need 'too many results error' instead of 'X amount of results') but otherwise the idea is the same |
|
18:28
π
|
joepie91 |
JAA: tl;dr obtains a complete set with the minimum amount of search requests |
|
18:28
π
|
JAA |
Oh, neat. |
|
18:28
π
|
JAA |
Thanks! I have to leave now, but I'll take a closer look later. |
|
18:28
π
|
JAA |
This sounds perfect. |
|
18:29
π
|
joepie91 |
JAA: it was, incidentally, for an older archiveteam project :) |
|
18:35
π
|
godane |
SketchCow: here are some items are having trouble : https://pastebin.com/W6mkfij5 |
|
18:35
π
|
godane |
at least 2 of the items history i have check don't say its dark |
|
18:35
π
|
godane |
*history log |
|
18:36
π
|
|
bwn has quit IRC (Read error: Operation timed out) |
|
19:19
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
19:20
π
|
|
dashcloud has joined #archiveteam-bs |
|
19:24
π
|
|
kristian_ has joined #archiveteam-bs |
|
19:28
π
|
JAA |
joepie91: How surprising. :-P |
|
19:32
π
|
|
kristian_ has quit IRC (Ping timeout: 370 seconds) |
|
19:32
π
|
|
kristian_ has joined #archiveteam-bs |
|
19:38
π
|
|
Stilett0- has joined #archiveteam-bs |
|
19:38
π
|
|
ruunyan has quit IRC (Ping timeout: 250 seconds) |
|
19:44
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
19:47
π
|
JAA |
So, regarding the entrée/entrèe situation: the search completely ignores the special characters, splits it into two words, and probably ignores "e" as a search word. "entr" obviously matches the entry which contains "entrèe". |
|
19:48
π
|
|
dashcloud has joined #archiveteam-bs |
|
19:49
π
|
JAA |
Interestingly, you can't search for "esp", but "espΓ©" works (but is not the same as "esp*"). |
|
20:26
π
|
|
kristian_ has quit IRC (Read error: Operation timed out) |
|
20:33
π
|
|
kristian_ has joined #archiveteam-bs |
|
20:36
π
|
|
kristian_ has quit IRC (Client Quit) |
|
20:42
π
|
JAA |
Huh, I found a query which returns over one third of all "items" (= collection entries, as far as I can tell). I probably don't have to mention how slow it is. |
|
20:43
π
|
JAA |
I don't have the slightest idea why it works though. I entered "*a-b*". |
|
21:16
π
|
|
schbirid2 has quit IRC (Quit: Leaving) |
|
21:20
π
|
|
noirscape has quit IRC (Read error: Connection reset by peer) |
|
22:11
π
|
joepie91 |
lol |
|
22:18
π
|
|
bwn has joined #archiveteam-bs |
|
22:28
π
|
JAA |
My scrape is running now. I'm simply bruteforcing all three-character prefixes, i.e. aaa*, aab*, ... zzz*. |
|
22:29
π
|
JAA |
I ended up writing my own code since the specifics were quite a bit different than in your case and I figured I'd be faster this way. |
|
22:36
π
|
|
Stilett0- is now known as Stiletto |
|
23:09
π
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
|
23:56
π
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
|
23:57
π
|
|
dashcloud has joined #archiveteam-bs |