Time |
Nickname |
Message |
00:01
🔗
|
JAA |
icedice: So uh, the scrape of the first 1000 pages for site:spit.mixtape.moe returned 196 URLs... :-| |
00:02
🔗
|
icedice |
:/ |
00:02
🔗
|
Flashfire |
its a start |
00:02
🔗
|
icedice |
Yeah |
00:02
🔗
|
Flashfire |
Could we run it through URLteam? |
00:02
🔗
|
Flashfire |
Then use the results for a scrape? |
00:02
🔗
|
JAA |
Flashfire: No, codes are too long. See the math yesterday. |
00:02
🔗
|
Flashfire |
oh ok |
00:03
🔗
|
JAA |
Oh you mean scrape Bing through URLTeam? Hell no. :-P |
00:03
🔗
|
JAA |
It really just shows how much Bing sucks. |
00:04
🔗
|
JAA |
That said: out of the 55k mixtape.moe URLs Fusl gave me, only 308 are on spit. |
00:05
🔗
|
JAA |
So it seems like there really are few pastes on there. |
00:14
🔗
|
|
benjinsmi has quit IRC (Leaving) |
00:18
🔗
|
godane |
so this guy is posting on my Nightline reviews: https://archive.org/details/@peterfranz_0&tab=reviews |
00:21
🔗
|
godane |
like one comment said: Take those goddamn motherfucking Nightline episodes from 6/2/2004 to 12/30/2005 and fucking post them onto www.archive.org right this fucking minute. |
00:21
🔗
|
Flashfire |
Blimey he doesnt seem happy |
00:22
🔗
|
godane |
first review here was not that bad: https://archive.org/details/ABC-Nightline-2004-06-01 |
00:23
🔗
|
godane |
so like after 5 days he lost it or something |
00:26
🔗
|
icedice |
JAA: Any plans on crawling https://archive.nyafuu.org/_/search/text/mixtape.moe/ https://rbt.asia/_/search/text/mixtape.moe/ https://archive.4plebs.org/_/search/text/mixtape.moe/ and https://warosu.org/ ? |
00:27
🔗
|
astrid |
godane: wow .. |
00:35
🔗
|
|
bithippo has joined #archiveteam-bs |
00:38
🔗
|
icedice |
Also: https://yuki.la/search.html?boards=all&startdate=2008-02-02&enddate=2019-03-25&sort=descending&postext=mixtape.moe#page=1 |
00:45
🔗
|
godane |
astrid: i just find it weird to go from being nice to dropping f bombs at the person you want to continue uploading stuff like it |
00:45
🔗
|
astrid |
right?? |
00:50
🔗
|
|
benjins has joined #archiveteam-bs |
00:50
🔗
|
JAA |
icedice: Scraping the first three now, though there'll probably be a lot of overlap with archived.moe and desuarchive.org. I don't see any search on Warosu. |
00:50
🔗
|
SketchCow |
I've deleted all his reviews. |
00:51
🔗
|
JAA |
icedice: archive.4plebs.org has a very strict rate limit on the search, so skipping that. |
00:51
🔗
|
icedice |
https://thebarchive.com/_/search/text/%22mixtape.moe%22/ |
00:52
🔗
|
icedice |
^ Found another one that isn't on the wiki |
00:53
🔗
|
icedice |
thebarchive.com + randomarchive.com + yuki.la are not on the list of 4chan archivers |
00:53
🔗
|
icedice |
Warosu has a search feature, but it's per board |
00:54
🔗
|
icedice |
Yeah, I think it will be a lot of the same links |
00:54
🔗
|
JAA |
rbt.asia's search is really slow. |
00:54
🔗
|
icedice |
Some might have started archiving a board earlier than another one and some might have had outages, and so on |
00:55
🔗
|
icedice |
So that's one reason to double check using multiple archivers |
00:55
🔗
|
JAA |
541 additional links from archive.nyafuu.org. |
00:55
🔗
|
icedice |
Not bad |
00:59
🔗
|
JAA |
This is getting really messy. I won't be adding anything further for now. |
01:00
🔗
|
JAA |
The test job seems to have completed successfully, by the way. |
01:01
🔗
|
JAA |
Bing site:my.mixtape.moe returned 3 results from 100 pages....... Sigh. |
01:14
🔗
|
icedice |
Ok |
01:15
🔗
|
icedice |
I'll try to get the Mixtape.moe links from those forums manually next week |
01:15
🔗
|
JAA |
Current URL count: 82951 |
01:15
🔗
|
icedice |
Not sure I can do much else to help |
01:18
🔗
|
JAA |
82950*, there was a BOM in a duplicated URL. |
01:18
🔗
|
icedice |
https://searx.me/?preferences=eJx1VcGO2zgM_Zr6Ysyi3R725MOiRbEFFpiime5VoCXa4VoWXVFOxv36Uhk7VjrpIQ6skO89PopMRJl9EsPBBDybBG3zCbxg5ZhMRGF_wtgw6OsfHPvKQ-hn6LHB8PDtUHm24LGpHAm0Hp2Z_NxTkOY_Gh88DWiOnAZc5M2fHx4nDBlS-YynMFxJ28hnwZjJc9zn0FGghEZsZO-3zL-tRRHz8fGzgpyjBlQ0qhIzRX5eVtUwJ7Y8Th6TqhLoUBCiPTZvq3TEERsWC7HCcKv2gL4zystxhEQcsoynCHZQVd--_quEI6sPevrP09OXwyZA3w8X-FyJhYQ9x8UIerRptwSDcqA0A9kBtALTkcfM0EXEWrhLZ4hYO4qalgEMJf31xJCMEbYEvh7REeghBTDmRA45A0R0jl4HzUEm |
01:18
🔗
|
icedice |
D3JUoGxQjuyZe4-1Hi81TFMh4uPs1FzTY8AI2Wxr7UM6FSw99VoISCqjNmrt36UUF5ncjhoWgILezXbIn55LjKshO1Uhc5yFrDGXL_1pgeDw-Z6E_aRHHBKNKKuFLyo2BEd9mwoysL3w_Dt2PhGW0t49F_UIoptQL8cVfOT_lbwUs8Ffk1hvccSJC_N3nFc93b17kb2loA5Ldnon6iAyXztxhDZCfqwe5OtJJ6xJ7onb5LeUWu0PpjVrgSPf9KqLOrEEthCWeFg4sRx5gLBDrXL2VEzLyEEnHsvTl95cy-4ijOCpjbgqGJdRhy4udYoQxOtwuXsFbLbkVNDoHCG3tX4_Q7i5JvttrF-KudvZ26u8vf1S9FX_2qXNhF_C9n7_CDCWWuRMIlb3X6nBQoi1pDjbNMcsTyxhsHhByA217LDOj9WsAmWHfvf-_V_PN3OcPL |
01:18
🔗
|
icedice |
RbRoKYprw-S2NcXzu8bN-8BO_O6qXC6wLddls1Yjqya748Hp6qdRPq_DQrQHVZuw-SFv2z8NxT1n76CXM1cVc=&q="mixtape.moe"+site:steamcommunity.com |
01:18
🔗
|
JAA |
What a URL. |
01:21
🔗
|
icedice |
Yeah |
01:21
🔗
|
icedice |
https://tiny.cc/searxforarchivation |
01:22
🔗
|
icedice |
^ I selected every supported meta search engine into the configuration and deselected all other sites (like image search and so on, dictionary search, and so on) |
01:23
🔗
|
icedice |
Is that usable for crawling? |
01:25
🔗
|
JAA |
I've tried to scrape searx.me before, and that didn't work well. Rate limits and other issues I don't remember right now. |
01:25
🔗
|
JAA |
It'd be better to run our own Searx instance instead. |
01:28
🔗
|
icedice |
Yeah |
01:28
🔗
|
icedice |
There is a list of searx instances here btw: https://github.com/asciimoo/searx/wiki/Searx-instances |
01:33
🔗
|
icedice |
Is the site design and formatting of searx scrapeable though if we got searx instance with decent limits? |
01:33
🔗
|
JAA |
Don't even need to scrape since it has JSON output as well. |
01:34
🔗
|
icedice |
Ah, nice |
01:48
🔗
|
|
bithippo has quit IRC (Textual IRC Client: www.textualapp.com) |
01:49
🔗
|
|
BlueMax has quit IRC (Quit: Leaving) |
01:54
🔗
|
hook54321 |
VoynichCr, JAA: How should we add sites to the list that have all of their subdomains excluded from the Wayback Machine? |
02:04
🔗
|
Flashfire |
To which list hook54321 |
02:11
🔗
|
hook54321 |
https://www.archiveteam.org/index.php?title=List_of_websites_excluded_from_the_Wayback_Machine |
02:13
🔗
|
Flashfire |
Thats actually not a bot updated one |
02:14
🔗
|
|
balrog has quit IRC (Quit: Bye) |
02:14
🔗
|
hook54321 |
i know |
02:14
🔗
|
Flashfire |
I created that one inititally hook54321 I just add each of the subdomains individually myself but its personal preference |
02:16
🔗
|
Flashfire |
Who is Amerepheasant |
02:16
🔗
|
|
balrog has joined #archiveteam-bs |
02:39
🔗
|
|
ndiddy has quit IRC (Ping timeout: 268 seconds) |
02:46
🔗
|
|
BlueMax has joined #archiveteam-bs |
02:48
🔗
|
|
wyatt8740 has joined #archiveteam-bs |
03:02
🔗
|
|
Panasonic has joined #archiveteam-bs |
03:28
🔗
|
godane |
dashcloud: so ffmpeg got the audio issue again |
03:29
🔗
|
godane |
SketchCow: i may need to reboot so don't touch whats in the Godane VHS Capture folder |
03:30
🔗
|
godane |
anyways there are 2 theories |
03:30
🔗
|
godane |
one is don't upload when capturing to same hd drive |
03:31
🔗
|
godane |
2nd theory is that the zram swap is too full |
03:32
🔗
|
|
godane has quit IRC (Quit: Leaving.) |
03:40
🔗
|
|
godane has joined #archiveteam-bs |
03:43
🔗
|
Lord_Nigh |
ftp://89.179.20.136/oldftp/TOOLS/SOFTWARE/Lang/ASEMBLER/ i suspect there's some more stuff from the lost motorola/freescale bbs lurking in there |
03:43
🔗
|
Lord_Nigh |
its sad that no complete mirror of that ftp site was ever made before it went defunct, or even a directory listing that i'm aware of |
03:44
🔗
|
Lord_Nigh |
given the bits of it spread all over old ftp sites on the net i'm guessing more than half of it could be reconstructed |
03:44
🔗
|
Flashfire |
Lord_Nigh many of them are on the FTP list. I have kind of abandoned it with no work going into it on the coding end and me having other duties to fulfill |
03:45
🔗
|
Flashfire |
But there is a fairly large list on the FTP/List now |
03:45
🔗
|
Lord_Nigh |
so the ftp project is officially dead, or just on hiatus? |
03:45
🔗
|
Flashfire |
https://www.archiveteam.org/index.php?title=FTP/List |
03:45
🔗
|
Flashfire |
Hiatus I hope |
03:45
🔗
|
Lord_Nigh |
when i find ftp sites which have a working http gateway sometimes i feed them to archivebot |
03:45
🔗
|
Lord_Nigh |
this one does not |
03:46
🔗
|
Flashfire |
I have done a lot of work adding to the FTP/List but as far as I know the code is outdated |
03:46
🔗
|
Flashfire |
Archivebot supports FTP and makes them into WARCS but I dont believe FTP is incorporated into WAYBACK |
03:46
🔗
|
Flashfire |
I dont have the coding knowledge or the time to gain that coding knowledge to check if the code for grabbing FTP is currently up to date |
03:47
🔗
|
Flashfire |
#effteepee is our channel for that project if you want to join me there for the discussion but its pretty dead looking |
03:48
🔗
|
Flashfire |
Lord_Nigh join that channel to discuss further |
04:28
🔗
|
|
qw3rty112 has joined #archiveteam-bs |
04:34
🔗
|
|
qw3rty111 has quit IRC (Read error: Operation timed out) |
04:35
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 255 seconds) |
04:36
🔗
|
|
Mateon1 has joined #archiveteam-bs |
04:42
🔗
|
|
odemgi has joined #archiveteam-bs |
04:44
🔗
|
|
odemgi_ has quit IRC (Ping timeout: 252 seconds) |
04:50
🔗
|
|
odemg has quit IRC (Ping timeout: 615 seconds) |
04:57
🔗
|
|
odemg has joined #archiveteam-bs |
05:05
🔗
|
|
Despatche has quit IRC (Read error: Operation timed out) |
05:48
🔗
|
|
dhyan_nat has joined #archiveteam-bs |
05:48
🔗
|
|
Despatche has joined #archiveteam-bs |
06:23
🔗
|
marked |
Who's yipdw ? questions about an tracker branch by them |
06:43
🔗
|
|
wp494 has quit IRC (Ping timeout: 506 seconds) |
06:43
🔗
|
|
wp494 has joined #archiveteam-bs |
06:45
🔗
|
|
killsushi has quit IRC (Quit: Leaving) |
06:58
🔗
|
|
Exairnous has quit IRC (Read error: Operation timed out) |
06:58
🔗
|
|
Exairnous has joined #archiveteam-bs |
07:14
🔗
|
hook54321 |
marked: they're not active right now. someone else might be able to answer them though. |
08:25
🔗
|
marked |
K, thanks. There's a branch called "Teach tracker how to value some upload locations more enhancement" any knowledge why this code was abandoned? https://github.com/ArchiveTeam/universal-tracker/pull/20 |
08:47
🔗
|
|
dhyan_nat has quit IRC (Read error: Connection reset by peer) |
08:47
🔗
|
|
dhyan_nat has joined #archiveteam-bs |
09:41
🔗
|
|
MrRadar_ has joined #archiveteam-bs |
09:44
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
10:20
🔗
|
|
Despatche has quit IRC (Quit: Read error: Connection reset by deer) |
10:41
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
10:57
🔗
|
godane |
dashcloud: the audio sync issue is back |
10:58
🔗
|
godane |
a part me think it was my fail cause i was messing with firefox debug mode during capturing |
11:21
🔗
|
|
Panasonic has quit IRC (Remote host closed the connection) |
12:33
🔗
|
JAA |
hook54321: I think for readability it might be a good idea to list all subdomains on one line. Start with the whole domain, then list everything else afterwards. My bot preserves the lines as they are and sorts them by the first URL it finds. |
12:33
🔗
|
JAA |
Otherwise the subdomains are scattered throughout the whole list and it gets messy. |
12:34
🔗
|
JAA |
Depends a bit on the context though of course. If it's a free web hoster type thing with usernames as subdomains, it might make more sense to have individual entries. |
13:22
🔗
|
Jopik |
hello anyone knows by chance who wrote this static json index search thingy? hhttps://archive.org/download/webshots-freeze-frame-index/index.html is there any code that generates the index files? I want to use this for the blogspot google+ comment database to find the exact file for a domain |
13:23
🔗
|
JAA |
Would be very useful for Tumblr as well since there have been various people asking how to find a particular blog in the archives. |
13:24
🔗
|
Jopik |
I tried to look for any source or docs on github and the wider web and couldn't find anything, so it seems it's original work done by someone on archiveteam. |
13:40
🔗
|
|
Dj-Wawa has joined #archiveteam-bs |
14:48
🔗
|
|
MrRadar_ is now known as MrRadar |
14:48
🔗
|
|
MrRadar2 sets mode: +o MrRadar |
14:50
🔗
|
marked |
MrRadar : is there somewhere I can read about what you found the rsync error to be? |
14:50
🔗
|
MrRadar |
Take a look at the pull request and commit messages |
14:51
🔗
|
MrRadar |
https://github.com/ArchiveTeam/seesaw-kit/pull/113 |
14:52
🔗
|
MrRadar |
Basically I just looked at the exceptions, figured out what was causing them and also what effect they had on the control flow |
14:54
🔗
|
MrRadar |
tl;dr: Some system calls can randomly fail due to being interrupted by signals, which can get generated by child processes |
14:54
🔗
|
MrRadar |
That was causing an exception that would cause the item to fail |
14:54
🔗
|
MrRadar |
Except the wrong task was being notified of the failure |
14:54
🔗
|
MrRadar |
Resulting in the item not actually failing |
14:57
🔗
|
marked |
Ok, thanks. I understand that better than the code changes. Man, I wish I realized this was from 2013 instead of just started on google+ before I started down that rabbit hole. |
14:58
🔗
|
MrRadar |
So now it caches those exceptions and retries the operation, notifies the correct task of failures, and doesn't fail external process tasks for internal Pyton tasks |
14:58
🔗
|
MrRadar |
*internal python errors |
14:58
🔗
|
marked |
If there wrong thread was notified, is there a chance an item would get confirmed but the rsync had failed? |
14:59
🔗
|
MrRadar |
It wasn't obvious what was going on at first but I am surprised nobody else took the time to dig into it |
15:00
🔗
|
MrRadar2 |
I actually did observe that behavior: the item would "fail" due to this spuriuos exception but the upload would continue and it would even get confirmed with the tracker, however the upload would then be retried over and over |
15:03
🔗
|
marked |
could that also explain duplicate confirms to the tracker? |
15:03
🔗
|
MrRadar2 |
Possibly |
15:06
🔗
|
marked |
--concurrent 1 would presumably never have a misrouted fail? |
15:06
🔗
|
MrRadar2 |
It still could (EINTR exceptions can be caused by any signal), though it would be much less likely (since you are only dealing with 1 child process at a time instead of up to 20) |
15:10
🔗
|
marked |
that matches my experimental data, that I've seen it at level 1 before, but was hard to get it to happen again |
15:11
🔗
|
|
VerifiedJ has quit IRC (Read error: Connection reset by peer) |
15:12
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
15:30
🔗
|
|
ndiddy has joined #archiveteam-bs |
15:33
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
15:36
🔗
|
|
dhyan_nat has joined #archiveteam-bs |
15:45
🔗
|
|
wp494 has quit IRC (Ping timeout: 615 seconds) |
15:45
🔗
|
|
wp494 has joined #archiveteam-bs |
16:10
🔗
|
|
Dj-Wawa has quit IRC (Quit: Connection closed for inactivity) |
16:34
🔗
|
|
Stilett0- has joined #archiveteam-bs |
16:37
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
18:01
🔗
|
|
gogondwan has quit IRC (Ping timeout: 260 seconds) |
18:22
🔗
|
|
Asparagir has joined #archiveteam-bs |
19:00
🔗
|
|
VerifiedJ has quit IRC (Read error: Connection reset by peer) |
19:01
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
19:18
🔗
|
|
Exairnous has quit IRC (Ping timeout: 252 seconds) |
19:26
🔗
|
|
Exairnous has joined #archiveteam-bs |
19:50
🔗
|
|
Asparagir has quit IRC (Asparagir) |
19:54
🔗
|
|
VerifiedJ has quit IRC (Read error: Connection reset by peer) |
19:54
🔗
|
|
VerifiedJ has joined #archiveteam-bs |
20:09
🔗
|
|
Exairnous has quit IRC (Read error: Connection reset by peer) |
20:14
🔗
|
|
Exairnous has joined #archiveteam-bs |
20:15
🔗
|
|
overflowe has quit IRC (Remote host closed the connection) |
20:20
🔗
|
|
lindalap has quit IRC (Quit: lindalap) |
20:21
🔗
|
|
Exairnous has quit IRC (Ping timeout: 268 seconds) |
20:27
🔗
|
|
overflowe has joined #archiveteam-bs |
20:41
🔗
|
hook54321 |
JAA: k, I'll do that when adding subdomains. |
20:41
🔗
|
|
Despatche has joined #archiveteam-bs |
20:42
🔗
|
hook54321 |
Also, we should try to find a way to take a large domain list and just go through them automatically in bulk to get a list of ones that are excluded |
20:49
🔗
|
|
alex___ has joined #archiveteam-bs |
21:59
🔗
|
|
dhyan_nat has quit IRC (Read error: Operation timed out) |
22:06
🔗
|
|
Stilett0- is now known as Stiletto |
22:20
🔗
|
|
BlueMax has joined #archiveteam-bs |
22:24
🔗
|
|
schbirid has quit IRC (Remote host closed the connection) |
22:50
🔗
|
|
PhrackD- has joined #archiveteam-bs |
22:50
🔗
|
|
PhrackD has quit IRC (Read error: Operation timed out) |
22:50
🔗
|
|
PhrackD- is now known as PhrackD |
23:02
🔗
|
|
alex___ has quit IRC (Quit: take care ye all. Have fun!) |